Introduction to Data Science Assignment 2023/2024¶

Professors: Pedro G. Ferreira, Alípio Jorge

Group Students: Gonçalo Rocha (up201707455), Manuela Pinheiro (up200400020), Sofia Coelho (up202103646)

In [1]:
import numpy as np
import seaborn as sns
import pylab as pyl
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
from sklearn import tree
from boruta import BorutaPy
from sklearn.svm import SVC
from scipy.stats import zscore
from sklearn.impute import KNNImputer
from mixed_naive_bayes import MixedNB
from imblearn.over_sampling import SMOTE
from sklearn.feature_selection import RFE
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve, auc
from imblearn.pipeline import make_pipeline
from sklearn.ensemble import VotingClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from imblearn.under_sampling import RandomUnderSampler
from sklearn.model_selection import cross_val_score, KFold
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.ensemble import GradientBoostingClassifier as GBC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

1. Business understanding¶

In this work project we propose to develop and deploy a predictive model for customer churn prediction in the telecommunications industry to proactively identify potential churners, allowing the company to implement targeted retention strategies and reduce customer attrition, following the CRISP-DM methodology.

1.1 -The first step is to understand the business problem and define success criteria:

The Business problem: The telecommunications company wants to make better business decisions to decrease the rate at which customers cease their subscriptions. This factor is critical to the company since, first, it reflects the customer satisfaction and, second, has a big impact on the company´s revenues. Identifying the factors and the patterns that lead to that action is crucial to 1) try to change/optimize those factors and 2) re-enforce retention efforts on those clients. Implementing targeted retention efforts for potential churners is expected to be more cost-effective than acquiring new customers.

The Business criteria: Reducing costs by increasing customer retention.

1.2 -Then, we should translate it to a machine learning problem:

The Machine Learning problem: A classification problem: given a dataset of different variables of a client finding a classification model that can predict if the client is going to cease their subscription or not.

The Machine Learning criteria: Prediction Accuracy: Achieve a high accuracy rate in predicting customer churn (that is specifically good at predicting the class Churn = yes) Precision and Recall: Balance precision and recall to effectively identify potential churners without excessive false positives. Model Robustness: Develop a robust model that generalizes well to new data and changing patterns.

1.3 - The key questions that we want to explore during the project are:

What factors contribute most to customer churn in the telecommunications industry? How well can we predict customer churn based on the available data? What features are most indicative of potential churners? How can the company effectively allocate resources to retain customers and reduce churn?

2. Data Understanding¶

2.1 Explore the data¶

The first step in data understanding is to explore the structure and format of the data, identify the type of variables and examine the size and shape of the dataset.

In [2]:
data = pd.read_csv("churn_data.csv")
data
Out[2]:
churn accountlength internationalplan voicemailplan numbervmailmessages totaldayminutes totaldaycalls totaldaycharge totaleveminutes totalevecalls totalevecharge totalnightminutes totalnightcalls totalnightcharge totalintlminutes totalintlcalls totalintlcharge numbercustomerservicecalls
0 No 128.0 no yes 25.0 265.1 110.0 45.07 197.4 99.0 16.78 244.7 91.0 11.01 10.0 3.0 2.70 1.0
1 No 107.0 no yes 26.0 161.6 123.0 27.47 195.5 103.0 16.62 254.4 103.0 11.45 13.7 3.0 3.70 1.0
2 No 137.0 no no 0.0 243.4 NaN 41.38 121.2 110.0 10.30 162.6 104.0 7.32 12.2 5.0 3.29 0.0
3 No 84.0 yes no 0.0 299.4 71.0 50.90 61.9 88.0 5.26 196.9 89.0 8.86 6.6 7.0 1.78 2.0
4 No 75.0 yes no 0.0 166.7 113.0 28.34 148.3 122.0 12.61 186.9 121.0 8.41 10.1 3.0 2.73 3.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
4995 No 50.0 no yes 40.0 235.7 127.0 40.07 223.0 126.0 18.96 297.5 116.0 13.39 9.9 5.0 2.67 2.0
4996 Yes 152.0 no no 0.0 184.2 90.0 31.31 256.8 73.0 21.83 213.6 113.0 9.61 14.7 2.0 3.97 3.0
4997 No 61.0 no no 0.0 140.6 89.0 23.90 172.8 128.0 14.69 212.4 97.0 9.56 13.6 4.0 3.67 1.0
4998 No 109.0 no no 0.0 188.8 67.0 32.10 171.7 92.0 14.59 224.4 89.0 NaN 8.5 6.0 2.30 0.0
4999 No 86.0 no yes 34.0 129.4 102.0 22.00 267.1 104.0 22.70 154.8 100.0 6.97 9.3 16.0 2.51 0.0

5000 rows × 18 columns

In [3]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 18 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   churn                       5000 non-null   object 
 1   accountlength               4951 non-null   float64
 2   internationalplan           4950 non-null   object 
 3   voicemailplan               4950 non-null   object 
 4   numbervmailmessages         4950 non-null   float64
 5   totaldayminutes             4950 non-null   float64
 6   totaldaycalls               4950 non-null   float64
 7   totaldaycharge              4950 non-null   float64
 8   totaleveminutes             4950 non-null   float64
 9   totalevecalls               4950 non-null   float64
 10  totalevecharge              4950 non-null   float64
 11  totalnightminutes           4950 non-null   float64
 12  totalnightcalls             4950 non-null   float64
 13  totalnightcharge            4950 non-null   float64
 14  totalintlminutes            4950 non-null   float64
 15  totalintlcalls              4950 non-null   float64
 16  totalintlcharge             4950 non-null   float64
 17  numbercustomerservicecalls  4950 non-null   float64
dtypes: float64(15), object(3)
memory usage: 703.3+ KB
  • Churn: Binary Variable ("No" or "Yes")
  • Account Lenght: Continuos Variable, Ratio-scaled Variable
  • International Plan: Binary Variable ("No" or "Yes")
  • Voicemail Plan: Binary Variable ("No" or "Yes")
  • Number of Email Messages: Continuos Variable, Ratio-scaled Variable
  • Total Day Minutes: Continuos Variable, Ratio-scaled Variable
  • Total Day Calls: Discrete Numerical Variable, Ratio-scaled Variable
  • Total Day Charge: Continuous Variable, Ratio-scaled Variable
  • Total Evening Minutes: Continuous Variable, Ratio-scaled Variable
  • Total Evening Calls: Discrete Numerical Variable, Ratio-scaled Variable
  • Total Evening Charge: Continuous Variable, Ratio-scaled Variable
  • Total Night Minutes: Continuous Variable, Ratio-scaled Variable
  • Total Night Calls: Discrete Numerical Variable, Ratio-scaled Variable
  • Total Night Charge: Continuous Variable, Ratio-scaled Variable
  • Total International Minutes: Continuous Variable, Ratio-scaled Variable
  • Total International Calls: Discrete Numerical Variable, Ratio-scaled Variable
  • Total International Charge: Continuous Variable, Ratio-scaled Variable
  • Number Custumer Service Calls: Discrete Numerical Variable, Ratio-scaled Variable

2.2 Detecting duplicates and missing values¶

In [4]:
print(str(data.duplicated().value_counts()))
False    5000
Name: count, dtype: int64
In [5]:
M = data.isna().sum()
M
Out[5]:
churn                          0
accountlength                 49
internationalplan             50
voicemailplan                 50
numbervmailmessages           50
totaldayminutes               50
totaldaycalls                 50
totaldaycharge                50
totaleveminutes               50
totalevecalls                 50
totalevecharge                50
totalnightminutes             50
totalnightcalls               50
totalnightcharge              50
totalintlminutes              50
totalintlcalls                50
totalintlcharge               50
numbercustomerservicecalls    50
dtype: int64
In [6]:
M.sum()
Out[6]:
849
In [7]:
data.count()/data.shape[0]
Out[7]:
churn                         1.0000
accountlength                 0.9902
internationalplan             0.9900
voicemailplan                 0.9900
numbervmailmessages           0.9900
totaldayminutes               0.9900
totaldaycalls                 0.9900
totaldaycharge                0.9900
totaleveminutes               0.9900
totalevecalls                 0.9900
totalevecharge                0.9900
totalnightminutes             0.9900
totalnightcalls               0.9900
totalnightcharge              0.9900
totalintlminutes              0.9900
totalintlcalls                0.9900
totalintlcharge               0.9900
numbercustomerservicecalls    0.9900
dtype: float64
In [8]:
missing_values = data[data.isnull().any(axis=1)]
missing_values
Out[8]:
churn accountlength internationalplan voicemailplan numbervmailmessages totaldayminutes totaldaycalls totaldaycharge totaleveminutes totalevecalls totalevecharge totalnightminutes totalnightcalls totalnightcharge totalintlminutes totalintlcalls totalintlcharge numbercustomerservicecalls
2 No 137.0 no no 0.0 243.4 NaN 41.38 121.2 110.0 10.30 162.6 104.0 7.32 12.2 5.0 3.29 0.0
19 No 73.0 no no 0.0 224.4 NaN 38.15 159.5 88.0 13.56 192.8 74.0 8.68 13.0 2.0 3.51 1.0
20 No 147.0 no no NaN 155.1 117.0 26.37 239.7 93.0 20.37 208.8 133.0 9.40 10.6 4.0 2.86 0.0
24 No NaN no no 0.0 81.1 86.0 13.79 245.2 72.0 20.84 237.0 115.0 10.67 10.3 2.0 2.78 0.0
29 No NaN no no 0.0 119.3 117.0 20.28 215.1 109.0 18.28 178.7 90.0 8.04 11.1 1.0 3.00 1.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
4980 Yes 73.0 no no 0.0 177.2 NaN 30.12 270.5 84.0 22.99 241.8 112.0 10.88 12.3 2.0 3.32 3.0
4988 No NaN no no 0.0 157.0 101.0 26.69 208.8 127.0 17.75 113.3 109.0 5.10 16.2 2.0 4.37 2.0
4992 No 83.0 no no NaN 188.3 70.0 32.01 243.8 88.0 20.72 213.7 79.0 9.62 10.3 6.0 2.78 0.0
4993 No 73.0 no no 0.0 177.9 89.0 30.24 131.2 82.0 11.15 NaN 89.0 8.38 11.5 6.0 3.11 3.0
4998 No 109.0 no no 0.0 188.8 67.0 32.10 171.7 92.0 14.59 224.4 89.0 NaN 8.5 6.0 2.30 0.0

793 rows × 18 columns

From this information, we can extrapolate the number of missing values each variables has, as well as the number of rows that contain missing values. As the number of rows containing missing values is smaller than the sum of the missing values from all of the variables, we can conclude that there are rows of data containing more than one missing value.

2.3 Exploring Continuous / Numeric Variables¶

A statistical summary of each variable gives us important information about the spread of the data. This provides information regarding the minimum and maximum values (range), the first and third quartiles (25th and 75th percentiles), which allows for the calculation of the Inter Quarile Range (IQR), the mean and the median (50th percentile). Some of this information can also be represented using boxplots, which are a standardized way of displaying the dataset based on the statistical summary (the minimum, the maximum, the sample median, and the first and third quartiles). This helps us to visualize the data distribution, for a better understanding.

In [9]:
data.describe()
Out[9]:
accountlength numbervmailmessages totaldayminutes totaldaycalls totaldaycharge totaleveminutes totalevecalls totalevecharge totalnightminutes totalnightcalls totalnightcharge totalintlminutes totalintlcalls totalintlcharge numbercustomerservicecalls
count 4951.000000 4950.000000 4950.000000 4950.000000 4950.000000 4950.000000 4950.000000 4950.000000 4950.000000 4950.000000 4950.000000 4950.000000 4950.000000 4950.000000 4950.000000
mean 100.238295 7.763636 180.306625 100.038788 30.629386 200.679798 100.243838 17.048293 200.465697 99.932929 9.015240 10.259010 4.432525 2.772088 1.569091
std 39.718817 13.552928 53.926625 19.844529 9.148881 50.486434 19.837380 4.300503 50.498290 19.939450 2.276812 2.763712 2.448457 0.744552 1.305279
min 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 73.000000 0.000000 143.700000 87.000000 24.430000 166.425000 87.000000 14.130000 167.000000 87.000000 7.510000 8.500000 3.000000 2.300000 1.000000
50% 100.000000 0.000000 180.100000 100.000000 30.600000 201.000000 101.000000 17.090000 200.550000 100.000000 9.010000 10.300000 4.000000 2.780000 1.000000
75% 127.000000 17.000000 216.200000 113.000000 36.720000 234.100000 114.000000 19.897500 234.700000 113.000000 10.560000 12.000000 6.000000 3.240000 2.000000
max 243.000000 52.000000 351.500000 165.000000 59.760000 363.700000 170.000000 30.910000 395.000000 175.000000 17.770000 20.000000 20.000000 5.400000 9.000000

Similarly to boxplots, histograms and density plots are a way to visually represent our data. Histograms are graphs that shows the frequency of numerical data using rectangles. The height of a rectangle represents the distribution frequency of a variable. It helps to easily identify which variables have a higher frequency.

In [10]:
fig, ax = plt.subplots(3, 5, figsize=(15, 10))
variables = ['accountlength', 'numbervmailmessages', 'totaldayminutes', 'totaldaycalls', 'totaldaycharge']
indice = 0
for i in variables: 
    data[i].plot.box(ax=ax[0, indice])
    data[i].plot.density(ax=ax[1, indice], legend=True)
    data[i].plot.hist(ax=ax[2, indice], label=i, bins=10, legend=True)
    indice += 1
    
plt.tight_layout()
No description has been provided for this image
In [11]:
fig, ax = plt.subplots(3, 6, figsize=(15, 10))
variables = ['totaleveminutes', 'totalevecalls', 'totalevecharge', 'totalnightminutes' , 'totalnightcalls', 'totalnightcharge']
indice = 0
for i in variables: 
    data[i].plot.box(ax=ax[0, indice])
    data[i].plot.density(ax=ax[1, indice], legend=True)
    data[i].plot.hist(ax=ax[2, indice], label=i, bins=10, legend=True)
    indice += 1
    
plt.tight_layout()
No description has been provided for this image
In [12]:
fig, ax = plt.subplots(3, 4, figsize=(15, 10))
variables = ['totalintlminutes','totalintlcalls', 'totalintlcharge', 'numbercustomerservicecalls']
indice = 0
for i in variables: 
    data[i].plot.box(ax=ax[0, indice])
    data[i].plot.density(ax=ax[1, indice], legend=True)
    data[i].plot.hist(ax=ax[2, indice], label=i, bins=10, legend=True)
    indice += 1
    
plt.tight_layout()
No description has been provided for this image

Most of our variables closely follow a normal distribution, however, some variables, namely total international calls, and number of customer service calls, tend to skew to the left. This makes sense, given that these types of calls are not made often. Interestingly, the number of email messages reaches a peak frequency at a lower number, which means that most clients tend to send a small number of emails, but it follows a normal distribution with low frequencies at the higher number of emails sent.

Density plots can be seen as an extension of the histogram, however, the density plot can smooth out the distribution of values and reduce the noise. It visualizes the distribution of data over a given period, and the peaks show where values are concentratd.

2.4 Exploring Nominal and Binary Variables¶

The dataset has 3 binary variables, including the target label "churn". To visually analyse the distribution of nominal variables we used histograms.

In [13]:
variables = ['churn', 'internationalplan', 'voicemailplan']
fig, ax = plt.subplots(1, 3, figsize=(15, 5))

titles = ['Churn', 'International Plan', 'Voicemail Plan']

for i, variable in enumerate(variables):
    data[variable].value_counts().plot(kind='bar', ax=ax[i])
    ax[i].set_title(titles[i]) 

plt.tight_layout()
No description has been provided for this image

From the information we can see that all of the variables tend do lean to the "No" value. This is particularlly interesting in the "Churn" variable, since it shows that the majority of the clients choose to maintain their subscription.

2.5 - Detecting Outliers¶

Checking for the presence of outliers in the data is crucial because they can disproportionately influence the fitting of statistical models and lead to overfitting.

2.5.1 - Interquartile Range method¶

To identify potential outliers in our dataset we used the Interquartile Range (IQR) method, as is a robust method and less sensitive to extreme values.

Outliers may indicate errors in data collection, data entry, or other data preprocessing steps, but they also may represent interesting patterns or anomalies in the data. The representation of outlier limits in box plots offers a clear and intuitive way to identify, assess, and understand outliers in a dataset. It aids in visually assessing the distribution of data and provides a basis for further exploration and decision-making regarding the treatment of outliers in statistical analyses.

In [14]:
variables = ['accountlength', 'numbervmailmessages', 'totaldayminutes', 'totaldaycalls', 'totaldaycharge',
                   'totaleveminutes', 'totalevecalls', 'totalevecharge', 'totalnightminutes', 'totalnightcalls',
                   'totalnightcharge', 'totalintlminutes', 'totalintlcalls', 'totalintlcharge',
                   'numbercustomerservicecalls']

num_rows = 3 
num_cols = 5 
fig, axes = plt.subplots(nrows=num_rows, ncols=num_cols, figsize=(15, 15))

for i, variable in enumerate(variables):
    row = i // num_cols  
    col = i % num_cols  
    ax = data[[variable]].boxplot(ax=axes[row, col])
    q3 = data[variable].quantile(0.75)
    q1 = data[variable].quantile(0.25)
    iqr = q3 - q1
    topcut = q3 + 1.5 * iqr
    lowcut = q1 - 1.5 * iqr
    ax.axhline(topcut, c='red', ls='--')
    ax.axhline(lowcut, c='orange', ls='--')


plt.tight_layout()
No description has been provided for this image

Visually inspecting the above box plots, we observe that the 'account length', 'number of voice mail messages', 'total international calls', and 'number of customer service calls' variables only present outliers on the upper boundary. This could indicate that the data presents a positively skewed distribution, which is in accordance with the previous data distribution plots obtained for the 'number of voice mail messages', 'total international calls', and 'number of customer service calls' variables.

2.5.2 - Common outliers identification using IQR, Z-score and MAD methods¶

Additionally to the IQR method we also used the Z-score and Median Absolute Deviation (MAD) methods to identify potential outliers.

The Z-score measures how many standard deviations a particular data point is from the mean of the dataset and values with Z-scores greater than a certain threshold (we used 3) are flagged as outliers.

MAD is a measure of statistical dispersion that uses the median of absolute deviations from the median and outliers are determined using a threshold (we used 3).

Similar to the IQR method, MAD is a robust method and less sensitive to extreme values, as they use median-based measures. The Z-score can be influenced by extreme values, especially in datasets with non-normal distributions.

For all the variables that presented a normal distribution we identified the common outliers for the three methods, while for the variables with a skewed distribution we only used the IQR and MAD methods.

In [15]:
variables = ['accountlength', 'totaldayminutes', 'totaldaycalls', 'totaldaycharge',
             'totaleveminutes', 'totalevecalls', 'totalevecharge', 'totalnightminutes', 'totalnightcalls',
             'totalnightcharge', 'totalintlminutes', 'totalintlcharge']

def detect_outliers_iqr(data):
    Q1 = data.quantile(0.25)
    Q3 = data.quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    return data[(data < lower_bound) | (data > upper_bound)]

def detect_outliers_mad(data, threshold=3):
    median = data.median()
    abs_deviations = np.abs(data - median)
    mad = abs_deviations.median()
    return data[abs_deviations > threshold * mad]

outliers_iqr = {}
outliers_mad = {}
outliers_z = {}

for var in variables:
    outliers_iqr[var] = set(detect_outliers_iqr(data[var]).index)
    outliers_mad[var] = set(detect_outliers_mad(data[var]).index)
    
    mean = data[var].mean()
    std_dev = data[var].std()
    z_scores = (data[var] - mean) / std_dev
    outliers_z[var] = set(data[abs(z_scores) > 3].index)

common_outliers = {}
for var in variables:
    common_outliers[var] = outliers_iqr[var].intersection(outliers_mad[var], outliers_z[var])

for var in variables:
    print(f"Common outliers for '{var}':")
    print(common_outliers[var])
    print(f"Number of common outliers for '{var}': {len(common_outliers[var])}")
    print("\n")
Common outliers for 'accountlength':
{416, 1408, 4260, 4389, 4395, 1551, 3216, 817, 1886, 1751, 4379, 4798}
Number of common outliers for 'accountlength': 12


Common outliers for 'totaldayminutes':
{1345, 1986, 2594, 2753, 2252, 365, 1052, 2736, 1397, 4981, 985, 3993, 4892}
Number of common outliers for 'totaldayminutes': 13


Common outliers for 'totaldaycalls':
{1345, 1121, 740, 1989, 4049, 4562, 3187, 692, 468, 1397, 1460, 1144, 4155, 4831}
Number of common outliers for 'totaldaycalls': 14


Common outliers for 'totaldaycharge':
{1345, 1986, 2594, 2753, 2252, 365, 1052, 2736, 1397, 4981, 985, 3993, 4892}
Number of common outliers for 'totaldaycharge': 13


Common outliers for 'totaleveminutes':
{32, 2551, 1960, 3996, 3370, 2732, 1233, 2932, 821, 3989, 4375, 4885, 889, 2331, 4892}
Number of common outliers for 'totaleveminutes': 15


Common outliers for 'totalevecalls':
{960, 3940, 4741, 646, 301, 1615, 3219, 2932, 58}
Number of common outliers for 'totalevecalls': 9


Common outliers for 'totalevecharge':
{32, 2551, 3996, 3370, 2732, 1233, 2932, 821, 533, 4375, 3989, 889, 4885, 2331, 4892}
Number of common outliers for 'totalevecharge': 15


Common outliers for 'totalnightminutes':
{3107, 4707, 1317, 1445, 2663, 1260, 3247, 3376, 2321, 883, 244, 3060, 1238, 4727, 1113, 922, 3899, 3807}
Number of common outliers for 'totalnightminutes': 18


Common outliers for 'totalnightcalls':
{2659, 4707, 3560, 3211, 2988, 493, 3895, 3535, 2288, 4207, 3570, 2903, 3799, 4410, 4508}
Number of common outliers for 'totalnightcalls': 15


Common outliers for 'totalnightcharge':
{3107, 4707, 1317, 1445, 2663, 1260, 3247, 3376, 2321, 883, 244, 3060, 1238, 4727, 1113, 922, 3899, 3807}
Number of common outliers for 'totalnightcharge': 18


Common outliers for 'totalintlminutes':
{1028, 4371, 4504, 1564, 2345, 4396, 2733, 179, 3891, 1080, 2362, 314, 3906, 4935, 712, 3400, 3275, 3403, 3916, 3663, 2513, 595, 4949, 343, 2906, 3290, 4451, 2918, 488, 2669, 878, 3568, 4976, 115, 1400, 762}
Number of common outliers for 'totalintlminutes': 36


Common outliers for 'totalintlcharge':
{1028, 4371, 4504, 1564, 2345, 4396, 2733, 179, 3891, 1080, 2362, 314, 3906, 4935, 712, 3400, 3275, 3403, 3916, 3663, 2513, 595, 4949, 343, 2906, 3290, 4451, 2918, 488, 2669, 878, 3568, 4976, 115, 1400, 762}
Number of common outliers for 'totalintlcharge': 36


In [16]:
variables = ['numbervmailmessages', 'totalintlcalls', 'numbercustomerservicecalls']

def detect_outliers_iqr(data):
    Q1 = data.quantile(0.25)
    Q3 = data.quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    return data[(data < lower_bound) | (data > upper_bound)]

def detect_outliers_mad(data, threshold=3):
    median = data.median()
    abs_deviations = np.abs(data - median)
    mad = abs_deviations.median()
    return data[abs_deviations > threshold * mad]

outliers_iqr = {}
outliers_mad = {}

for var in variables:
    outliers_iqr[var] = set(detect_outliers_iqr(data[var]).index)
    outliers_mad[var] = set(detect_outliers_mad(data[var]).index)
    
common_outliers = {}
for var in variables:
    common_outliers[var] = outliers_iqr[var].intersection(outliers_mad[var])

for var in variables:
    print(f"Common outliers for '{var}':")
    print(common_outliers[var])
    print(f"Number of common outliers for '{var}': {len(common_outliers[var])}")
    print("\n")
Common outliers for 'numbervmailmessages':
{1797, 1285, 2570, 268, 3342, 4369, 4625, 277, 149, 790, 4502, 2457, 2716, 1441, 4516, 423, 4777, 4778, 1454, 3246, 2608, 4275, 1846, 1596, 2366, 1602, 3523, 1732, 3524, 3910, 71, 2887, 4933, 845, 1487, 2768, 3279, 3154, 599, 3543, 4824, 4059, 3165, 4445, 1378, 4706, 3174, 615, 872, 3433, 4713, 4588, 4717, 3826, 4338, 1908, 1269, 3577, 4859, 2686}
Number of common outliers for 'numbervmailmessages': 60


Common outliers for 'totalintlcalls':
{514, 3587, 2562, 4108, 2576, 22, 3097, 4126, 1567, 4641, 3109, 4134, 41, 4652, 1581, 3636, 3128, 2621, 1092, 4679, 588, 4684, 3664, 4693, 4184, 2156, 4726, 1657, 636, 642, 3206, 2703, 1168, 3727, 153, 1179, 3230, 674, 2212, 2724, 182, 185, 698, 4286, 4292, 211, 723, 2775, 3800, 219, 3291, 3310, 3826, 756, 4857, 250, 3839, 3840, 1797, 2826, 4877, 272, 2835, 4386, 1832, 1333, 4917, 4408, 3386, 2883, 837, 329, 842, 1355, 341, 854, 3418, 347, 4955, 863, 1889, 3938, 1392, 4465, 2930, 883, 377, 3454, 2947, 3973, 4999, 1419, 2956, 911, 3991, 921, 2970, 420, 1960, 3516, 957, 3536, 2001, 3025, 3541, 982, 4568, 474, 2018, 483, 1526, 504, 1021, 3071}
Number of common outliers for 'totalintlcalls': 114


Common outliers for 'numbercustomerservicecalls':
{3585, 1538, 3079, 3081, 522, 3595, 1038, 4110, 4114, 4628, 21, 4631, 542, 2592, 4130, 547, 4132, 4644, 3112, 48, 54, 3131, 3643, 4670, 3140, 3655, 3144, 588, 4687, 3665, 4692, 2646, 1121, 1638, 3692, 1133, 3181, 1142, 3190, 635, 1150, 1662, 4740, 3717, 136, 1673, 655, 3228, 1694, 4772, 1193, 2218, 1707, 2732, 3243, 4778, 2223, 1713, 181, 694, 4790, 2248, 721, 4307, 1241, 4315, 1246, 736, 2785, 2786, 742, 235, 2283, 1273, 4346, 3323, 771, 1284, 778, 1802, 2827, 3851, 2322, 2327, 293, 1831, 1325, 4406, 4919, 1851, 4417, 841, 1865, 332, 2380, 3918, 2387, 3413, 2903, 3421, 2402, 874, 2415, 4982, 1912, 2428, 1919, 1407, 3973, 902, 392, 2952, 2953, 4489, 908, 2444, 2958, 911, 2960, 2961, 3468, 3983, 3487, 2979, 1970, 4018, 1973, 1974, 3509, 4023, 4537, 2493, 3009, 3017, 4044, 974, 3026, 2515, 4562, 3549, 1502, 2031, 4085, 2553, 509}
Number of common outliers for 'numbercustomerservicecalls': 145


2.6 Bivariate Analysis¶

To test variable redundacy, correlation analysis was performed by groups. We decided to group variables in the scope of time periods and type of comunication service. Namely:

1- 'totaldayminutes', 'totaldaycalls', 'totaldaycharge';

2- 'totaleveminutes', 'totalevecalls', 'totalevecharge';

3- 'totalnightminutes', 'totalnightcalls', 'totalnightcharge';

4- 'totalintlminutes', 'totalintlcalls', 'totalintlcharge'.

In [17]:
data[['totaldayminutes', 'totaldaycalls',	'totaldaycharge']].corr()
Out[17]:
totaldayminutes totaldaycalls totaldaycharge
totaldayminutes 1.000000 0.003293 0.992380
totaldaycalls 0.003293 1.000000 -0.000558
totaldaycharge 0.992380 -0.000558 1.000000
In [18]:
data[['totaleveminutes', 'totalevecalls',	'totalevecharge']].corr()
Out[18]:
totaleveminutes totalevecalls totalevecharge
totaleveminutes 1.00000 0.004650 1.000000
totalevecalls 0.00465 1.000000 0.004467
totalevecharge 1.00000 0.004467 1.000000
In [19]:
data[['totalnightminutes', 'totalnightcalls',	'totalnightcharge']].corr()
Out[19]:
totalnightminutes totalnightcalls totalnightcharge
totalnightminutes 1.000000 0.024878 0.999999
totalnightcalls 0.024878 1.000000 0.028675
totalnightcharge 0.999999 0.028675 1.000000
In [20]:
data[['totalintlminutes','totalintlcalls', 'totalintlcharge']].corr()
Out[20]:
totalintlminutes totalintlcalls totalintlcharge
totalintlminutes 1.000000 0.018186 0.994521
totalintlcalls 0.018186 1.000000 0.022057
totalintlcharge 0.994521 0.022057 1.000000
In [21]:
fig, axs = plt.subplots(1, 4, figsize=(17, 5))

data.plot.scatter('totaldayminutes', 'totaldaycharge', ax=axs[0])
axs[0].set_title('Total day minutes vs Total day charge')

data.plot.scatter('totaleveminutes', 'totalevecharge', ax=axs[1])
axs[1].set_title('Total evening minutes vs Total evening charge')

data.plot.scatter('totalnightminutes', 'totalnightcharge', ax=axs[2])
axs[2].set_title('Total night minutes vs Total night charge')

data.plot.scatter('totalintlminutes', 'totalintlcharge', ax=axs[3])
axs[3].set_title('Total international minutes vs Total international charge')

plt.tight_layout()
plt.show()
No description has been provided for this image

A high correlation was observed between the variables 'total count of minutes' and 'total charge' within each group, suggesting a strong linear relationship between these variables.

These results indicate that one of these variables might be redundant and could potentially be removed from the dataset without significantly affecting the information conveyed by the variables.

3 - Data Preparation¶

The data set features "Churn", "Internationplan" and "Voicemailplan", which presented categorical values, ware converted into a numeric type so that it could be provided to models that only work with this type of data. Therefore, observations with the value "No" assumed the value 0 and the ones with "Yes" the value 1.

In [3]:
columns_to_encode = ['churn', 'internationalplan', 'voicemailplan']

label_encoder = LabelEncoder()

for col in columns_to_encode:
    data[col] = label_encoder.fit_transform(data[col])

data
Out[3]:
churn accountlength internationalplan voicemailplan numbervmailmessages totaldayminutes totaldaycalls totaldaycharge totaleveminutes totalevecalls totalevecharge totalnightminutes totalnightcalls totalnightcharge totalintlminutes totalintlcalls totalintlcharge numbercustomerservicecalls
0 0 128.0 0 1 25.0 265.1 110.0 45.07 197.4 99.0 16.78 244.7 91.0 11.01 10.0 3.0 2.70 1.0
1 0 107.0 0 1 26.0 161.6 123.0 27.47 195.5 103.0 16.62 254.4 103.0 11.45 13.7 3.0 3.70 1.0
2 0 137.0 0 0 0.0 243.4 NaN 41.38 121.2 110.0 10.30 162.6 104.0 7.32 12.2 5.0 3.29 0.0
3 0 84.0 1 0 0.0 299.4 71.0 50.90 61.9 88.0 5.26 196.9 89.0 8.86 6.6 7.0 1.78 2.0
4 0 75.0 1 0 0.0 166.7 113.0 28.34 148.3 122.0 12.61 186.9 121.0 8.41 10.1 3.0 2.73 3.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
4995 0 50.0 0 1 40.0 235.7 127.0 40.07 223.0 126.0 18.96 297.5 116.0 13.39 9.9 5.0 2.67 2.0
4996 1 152.0 0 0 0.0 184.2 90.0 31.31 256.8 73.0 21.83 213.6 113.0 9.61 14.7 2.0 3.97 3.0
4997 0 61.0 0 0 0.0 140.6 89.0 23.90 172.8 128.0 14.69 212.4 97.0 9.56 13.6 4.0 3.67 1.0
4998 0 109.0 0 0 0.0 188.8 67.0 32.10 171.7 92.0 14.59 224.4 89.0 NaN 8.5 6.0 2.30 0.0
4999 0 86.0 0 1 34.0 129.4 102.0 22.00 267.1 104.0 22.70 154.8 100.0 6.97 9.3 16.0 2.51 0.0

5000 rows × 18 columns

3.2 - Data Cleaning: missing values¶

Handling missing values in a dataset is crucial to ensure the integrity of the analysis.

Row removal¶

In [23]:
for row in data.itertuples():
    if data.isnull().sum(axis=1)[row[0]] >= 2:
        print(f"Row {row[0]} has {data.isnull().sum(axis=1)[row[0]]} NaN values")
Row 233 has 2 NaN values
Row 376 has 2 NaN values
Row 485 has 2 NaN values
Row 517 has 2 NaN values
Row 546 has 2 NaN values
Row 592 has 2 NaN values
Row 594 has 2 NaN values
Row 717 has 2 NaN values
Row 748 has 2 NaN values
Row 1082 has 2 NaN values
Row 1091 has 2 NaN values
Row 1178 has 2 NaN values
Row 1351 has 2 NaN values
Row 1437 has 2 NaN values
Row 1737 has 2 NaN values
Row 1839 has 2 NaN values
Row 1984 has 2 NaN values
Row 2095 has 2 NaN values
Row 2107 has 2 NaN values
Row 2246 has 2 NaN values
Row 2324 has 2 NaN values
Row 2524 has 2 NaN values
Row 2754 has 2 NaN values
Row 2762 has 2 NaN values
Row 2944 has 2 NaN values
Row 2947 has 2 NaN values
Row 3154 has 2 NaN values
Row 3262 has 2 NaN values
Row 3633 has 2 NaN values
Row 3686 has 3 NaN values
Row 3752 has 2 NaN values
Row 3791 has 3 NaN values
Row 3937 has 2 NaN values
Row 3967 has 2 NaN values
Row 4004 has 2 NaN values
Row 4071 has 2 NaN values
Row 4081 has 2 NaN values
Row 4147 has 2 NaN values
Row 4215 has 2 NaN values
Row 4396 has 2 NaN values
Row 4612 has 2 NaN values
Row 4727 has 2 NaN values
Row 4731 has 2 NaN values

As seen above, the rows that have more than one missing value only contain maximum of three missing values. Since there are 17 variables (other than our target variable, churn), we consider that this number is not significantly high for any of these rows to be removed.

Imputation¶

One strategy to address missing data involves imputation, where missing values are filled in.

In this dataset we applied the following Imputation methods:

1 - Mean Imputation, where missing values are replaced by the mean of the observed values for a certain variable. It is a simple and quick method but may not capture the true distribution if missing values are significant or occur in a pattern.

2 - Median Imputation, where missing values are replaced by the median of the observed values. Useful for handling outliers as it's less sensitive to extreme values compared to mean imputation, being a good alternative for skewed distributions or datasets with outliers.

3 - Linear Imputation, that uses a linear regression model to predict missing values based on other observed values. Effective when relationships between variables are linear and missing values are related to other observed variables. It captures relationships between features to estimate missing values.

4 - KNN Imputation, that uses the values of the k-nearest neighbors (nearest data points) to estimate missing values. It finds similar data points based on other features and uses their values to impute the missing value. Effective for datasets where the missingness has a complex pattern or where features have non-linear relationships. KNN imputation considers multiple features and their relationships to estimate missing values.

In [24]:
variables_to_fill = ['totaldaycalls', 'totalevecalls', 'totalnightcalls','totalintlcalls']

data_mean_imputed = data.copy()
data_median_imputed = data.copy()

for col in variables_to_fill:
    data_mean_imputed[col].fillna(data[col].mean(), inplace=True)
    data_median_imputed[col].fillna(data[col].median(), inplace=True)

data_interpolated = data.copy()
data_interpolated[variables_to_fill] = data_interpolated[variables_to_fill].interpolate(method='linear')

imputer = KNNImputer(n_neighbors=5)
data_knn_imputed = data.copy()
data_knn_imputed[variables_to_fill] = imputer.fit_transform(data_knn_imputed[variables_to_fill])

mean_imputed_stats = data_mean_imputed[variables_to_fill].describe()
median_imputed_stats = data_median_imputed[variables_to_fill].describe()
interpolated_stats = data_interpolated[variables_to_fill].describe()
knn_imputed_stats = data_knn_imputed[variables_to_fill].describe()

without_imputation = data[['totaldaycalls', 'totalevecalls','totalnightcalls','totalintlcalls']].describe()
print("Statistics without Imputation:")
print(without_imputation)

print("\nStatistics for Mean Imputation:")
print(mean_imputed_stats)

print("\nStatistics for Median Imputation:")
print(median_imputed_stats)

print("\nStatistics for Linear Interpolation:")
print(interpolated_stats)

print("\nStatistics for KNN Imputation:")
print(knn_imputed_stats)
Statistics without Imputation:
       totaldaycalls  totalevecalls  totalnightcalls  totalintlcalls
count    4950.000000    4950.000000      4950.000000     4950.000000
mean      100.038788     100.243838        99.932929        4.432525
std        19.844529      19.837380        19.939450        2.448457
min         0.000000       0.000000         0.000000        0.000000
25%        87.000000      87.000000        87.000000        3.000000
50%       100.000000     101.000000       100.000000        4.000000
75%       113.000000     114.000000       113.000000        6.000000
max       165.000000     170.000000       175.000000       20.000000

Statistics for Mean Imputation:
       totaldaycalls  totalevecalls  totalnightcalls  totalintlcalls
count    5000.000000    5000.000000      5000.000000     5000.000000
mean      100.038788     100.243838        99.932929        4.432525
std        19.745037      19.737924        19.839483        2.436181
min         0.000000       0.000000         0.000000        0.000000
25%        87.000000      87.000000        87.000000        3.000000
50%       100.000000     100.243838       100.000000        4.000000
75%       113.000000     113.000000       113.000000        6.000000
max       165.000000     170.000000       175.000000       20.000000

Statistics for Median Imputation:
       totaldaycalls  totalevecalls  totalnightcalls  totalintlcalls
count    5000.000000    5000.000000      5000.000000     5000.000000
mean      100.038400     100.251400        99.933600        4.428200
std        19.745037      19.738068        19.839484        2.436562
min         0.000000       0.000000         0.000000        0.000000
25%        87.000000      87.000000        87.000000        3.000000
50%       100.000000     101.000000       100.000000        4.000000
75%       113.000000     113.000000       113.000000        6.000000
max       165.000000     170.000000       175.000000       20.000000

Statistics for Linear Interpolation:
       totaldaycalls  totalevecalls  totalnightcalls  totalintlcalls
count    5000.000000    5000.000000      5000.000000     5000.000000
mean      100.021300     100.240400        99.940700        4.433500
std        19.797851      19.774226        19.904891        2.443263
min         0.000000       0.000000         0.000000        0.000000
25%        87.000000      87.000000        87.000000        3.000000
50%       100.000000     100.500000       100.000000        4.000000
75%       113.000000     114.000000       113.000000        6.000000
max       165.000000     170.000000       175.000000       20.000000

Statistics for KNN Imputation:
       totaldaycalls  totalevecalls  totalnightcalls  totalintlcalls
count     5000.00000    5000.000000      5000.000000      5000.00000
mean       100.02932     100.246000        99.939800         4.43208
std         19.76802      19.759962        19.855006         2.43898
min          0.00000       0.000000         0.000000         0.00000
25%         87.00000      87.000000        87.000000         3.00000
50%        100.00000     100.000000       100.000000         4.00000
75%        113.00000     113.050000       113.000000         6.00000
max        165.00000     170.000000       175.000000        20.00000
In [25]:
for col in variables_to_fill:
    plt.figure(figsize=(14, 4))

    plt.subplot(1, 5, 1)
    sns.histplot(data[col], bins=20, kde=True, color='purple', label='Without Imputation')
    plt.legend()
    
    plt.subplot(1, 5, 2)
    sns.histplot(data_mean_imputed[col], bins=20, kde=True, color='blue', label='Mean Imputation')
    plt.legend()
    
    plt.subplot(1, 5, 3)
    sns.histplot(data_median_imputed[col], bins=20, kde=True, color='orange', label='Median Imputation')
    plt.legend()

    plt.subplot(1, 5, 4)
    sns.histplot(data_interpolated[col], bins=20, kde=True, color='green', label='Linear Interpolation')
    plt.legend()

    plt.subplot(1, 5, 5)
    sns.histplot(data_knn_imputed[col], bins=20, kde=True, color='red', label='KNN Imputation')
    plt.legend()

    plt.tight_layout()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

From the analysis of the obtained statistics and visually inspecting the above histograms, the minor variations observed among the dataset before and after employing different imputation methods didn't significantly alter the statistical properties of the data.

This indicates that the various imputation methods preserved the dataset's integrity by estimating missing values in ways that align with the data's inherent statistical characteristics.

Considering that the KNN imputation method can handle both continuous and categorical variables and no assumptions about the distribution of data is required, we decided to use this method to fill the missing values in our dataset.

KNN Imputation¶

In [4]:
imputer = KNNImputer(n_neighbors=5)
data_imputed = imputer.fit_transform(data)

data = pd.DataFrame(data_imputed, columns=data.columns)

imputed_stats = data.describe()
print("\nStatistics for KNN Imputation:")
print(imputed_stats)
Statistics for KNN Imputation:
             churn  accountlength  internationalplan  voicemailplan  \
count  5000.000000    5000.000000        5000.000000    5000.000000   
mean      0.141400     100.226784           0.113800       0.282400   
std       0.348469      39.554740           0.347669       0.471905   
min       0.000000       1.000000           0.000000       0.000000   
25%       0.000000      73.000000           0.000000       0.000000   
50%       0.000000     100.000000           0.000000       0.000000   
75%       0.000000     127.000000           0.000000       1.000000   
max       1.000000     243.000000           2.000000       2.000000   

       numbervmailmessages  totaldayminutes  totaldaycalls  totaldaycharge  \
count          5000.000000      5000.000000    5000.000000     5000.000000   
mean              7.760680       180.303714     100.035360       30.638822   
std              13.499377        53.725615      19.766022        9.148733   
min               0.000000         0.000000       0.000000        0.000000   
25%               0.000000       143.900000      87.000000       24.430000   
50%               0.000000       180.050000     100.000000       30.600000   
75%              16.000000       216.000000     113.000000       36.750000   
max              52.000000       351.500000     165.000000       59.760000   

       totaleveminutes  totalevecalls  totalevecharge  totalnightminutes  \
count      5000.000000    5000.000000     5000.000000        5000.000000   
mean        200.634184     100.250200       17.054007         200.468532   
std          50.291279      19.759087        4.293597          50.276816   
min           0.000000       0.000000        0.000000           0.000000   
25%         166.800000      87.000000       14.140000         167.300000   
50%         201.000000     101.000000       17.090000         200.500000   
75%         233.800000     113.000000       19.900000         234.125000   
max         363.700000     170.000000       30.910000         395.000000   

       totalnightcalls  totalnightcharge  totalintlminutes  totalintlcalls  \
count       5000.00000       5000.000000       5000.000000      5000.00000   
mean          99.92300          9.017016         10.262268         4.42888   
std           19.85791          2.271356          2.752844         2.43862   
min            0.00000          0.000000          0.000000         0.00000   
25%           87.00000          7.510000          8.500000         3.00000   
50%          100.00000          9.020000         10.300000         4.00000   
75%          113.00000         10.557000         12.000000         6.00000   
max          175.00000         17.770000         20.000000        20.00000   

       totalintlcharge  numbercustomerservicecalls  
count      5000.000000                  5000.00000  
mean          2.771706                     1.56956  
std           0.741653                     1.29969  
min           0.000000                     0.00000  
25%           2.309000                     1.00000  
50%           2.780000                     1.00000  
75%           3.240000                     2.00000  
max           5.400000                     9.00000  

3.3 - Data Cleaning: outliers¶

We performed, in a copy of the dataset, removal of the common outliers using the IQR, Z-score and MAD methods.

In [27]:
data_out_rem = data.copy()

variables = ['accountlength', 'totaldayminutes', 'totaldaycalls', 'totaldaycharge',
             'totaleveminutes', 'totalevecalls', 'totalevecharge', 'totalnightminutes',
             'totalnightcalls', 'totalnightcharge', 'totalintlminutes', 'totalintlcharge']

def detect_outliers_iqr(data):
    Q1 = data.quantile(0.25)
    Q3 = data.quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    return data[(data < lower_bound) | (data > upper_bound)]

def detect_outliers_mad(data, threshold=3):
    median = data.median()
    abs_deviations = np.abs(data - median)
    mad = abs_deviations.median()
    return data[abs_deviations > threshold * mad]

outliers_iqr = {}
outliers_mad = {}
outliers_z = {}

for var in variables:
    outliers_iqr[var] = set(detect_outliers_iqr(data[var]).index)
    outliers_mad[var] = set(detect_outliers_mad(data[var]).index)
    
    mean = data[var].mean()
    std_dev = data[var].std()
    z_scores = (data[var] - mean) / std_dev
    outliers_z[var] = set(data[abs(z_scores) > 3].index)

common_outliers = {}
for var in variables:
    common_outliers[var] = outliers_iqr[var].intersection(outliers_mad[var], outliers_z[var])

common_indices = set.intersection(*common_outliers.values())
data_out_rem = data_out_rem.drop(common_indices)

for var in variables:
    common_var_outliers = common_outliers[var].intersection(data_out_rem.index)
    print(f"Number of common outliers removed for '{var}': {len(common_var_outliers)}")

stats_after = {}
for var in variables:
    stats_after[var] = data_out_rem[~data_out_rem.index.isin(common_outliers[var])][var].describe()

print("\nStatistics After Removing Outliers:")
for var, stats in stats_after.items():
    print(f"\nStatistics for '{var}':")
    print(stats)
Number of common outliers removed for 'accountlength': 12
Number of common outliers removed for 'totaldayminutes': 14
Number of common outliers removed for 'totaldaycalls': 14
Number of common outliers removed for 'totaldaycharge': 13
Number of common outliers removed for 'totaleveminutes': 17
Number of common outliers removed for 'totalevecalls': 9
Number of common outliers removed for 'totalevecharge': 15
Number of common outliers removed for 'totalnightminutes': 18
Number of common outliers removed for 'totalnightcalls': 15
Number of common outliers removed for 'totalnightcharge': 18
Number of common outliers removed for 'totalintlminutes': 39
Number of common outliers removed for 'totalintlcharge': 39

Statistics After Removing Outliers:

Statistics for 'accountlength':
count    4988.000000
mean       99.918388
std        39.097257
min         1.000000
25%        73.000000
50%       100.000000
75%       126.000000
max       217.000000
Name: accountlength, dtype: float64

Statistics for 'totaldayminutes':
count    4986.000000
mean      180.514054
std        53.030303
min        19.500000
25%       144.100000
50%       180.100000
75%       216.000000
max       338.400000
Name: totaldayminutes, dtype: float64

Statistics for 'totaldaycalls':
count    4986.000000
mean      100.127517
std        19.447740
min        42.000000
25%        87.000000
50%       100.000000
75%       113.000000
max       158.000000
Name: totaldaycalls, dtype: float64

Statistics for 'totaldaycharge':
count    4987.000000
mean       30.669035
std         9.038280
min         3.210000
25%        24.450000
50%        30.600000
75%        36.750000
max        57.530000
Name: totaldaycharge, dtype: float64

Statistics for 'totaleveminutes':
count    4983.000000
mean      200.806125
std        49.487535
min        52.900000
25%       167.000000
50%       201.000000
75%       233.800000
max       350.900000
Name: totaleveminutes, dtype: float64

Statistics for 'totalevecalls':
count    4991.000000
mean      100.271889
std        19.531729
min        42.000000
25%        87.000000
50%       101.000000
75%       113.000000
max       159.000000
Name: totalevecalls, dtype: float64

Statistics for 'totalevecharge':
count    4985.000000
mean       17.073894
std         4.232246
min         4.180000
25%        14.170000
50%        17.090000
75%        19.900000
max        29.930000
Name: totalevecharge, dtype: float64

Statistics for 'totalnightminutes':
count    4982.000000
mean      200.266792
std        49.353499
min        50.100000
25%       167.325000
50%       200.400000
75%       233.700000
max       350.200000
Name: totalnightminutes, dtype: float64

Statistics for 'totalnightcalls':
count    4985.000000
mean       99.917553
std        19.517249
min        41.000000
25%        87.000000
50%       100.000000
75%       113.000000
max       159.000000
Name: totalnightcalls, dtype: float64

Statistics for 'totalnightcharge':
count    4982.000000
mean        9.007922
std         2.229986
min         2.250000
25%         7.520000
50%         9.010000
75%        10.537500
max        15.760000
Name: totalnightcharge, dtype: float64

Statistics for 'totalintlminutes':
count    4961.000000
mean       10.309825
std         2.623528
min         2.100000
25%         8.600000
50%        10.400000
75%        12.000000
max        18.500000
Name: totalintlminutes, dtype: float64

Statistics for 'totalintlcharge':
count    4961.000000
mean        2.783654
std         0.706657
min         0.570000
25%         2.320000
50%         2.810000
75%         3.240000
max         4.970000
Name: totalintlcharge, dtype: float64
In [28]:
data_out_rem = data.copy()

variables = ['numbervmailmessages', 'totalintlcalls', 'numbercustomerservicecalls']

def detect_outliers_iqr(data):
    Q1 = data.quantile(0.25)
    Q3 = data.quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    return data[(data < lower_bound) | (data > upper_bound)]

def detect_outliers_mad(data, threshold=4):
    median = data.median()
    abs_deviations = np.abs(data - median)
    mad = abs_deviations.median()
    return data[abs_deviations > threshold * mad]

outliers_iqr = {}
outliers_mad = {}

for var in variables:
    outliers_iqr[var] = set(detect_outliers_iqr(data[var]).index)
    outliers_mad[var] = set(detect_outliers_mad(data[var]).index)
    
common_outliers = {}
for var in variables:
    common_outliers[var] = outliers_iqr[var].intersection(outliers_mad[var])

common_indices = set.intersection(*common_outliers.values())
data_out_rem = data_out_rem.drop(common_indices)

for var in variables:
    common_var_outliers = common_outliers[var].intersection(data_out_rem.index)
    print(f"Number of common outliers removed for '{var}': {len(common_var_outliers)}")

stats_after = {}
for var in variables:
    stats_after[var] = data_out_rem[~data_out_rem.index.isin(common_outliers[var])][var].describe()

print("\nStatistics After Removing Outliers:")
for var, stats in stats_after.items():
    print(f"\nStatistics for '{var}':")
    print(stats)
Number of common outliers removed for 'numbervmailmessages': 99
Number of common outliers removed for 'totalintlcalls': 114
Number of common outliers removed for 'numbercustomerservicecalls': 51

Statistics After Removing Outliers:

Statistics for 'numbervmailmessages':
count    4901.000000
mean        7.031504
std        12.606129
min         0.000000
25%         0.000000
50%         0.000000
75%         8.800000
max        40.000000
Name: numbervmailmessages, dtype: float64

Statistics for 'totalintlcalls':
count    4886.000000
mean        4.232788
std         2.070127
min         0.000000
25%         3.000000
50%         4.000000
75%         5.000000
max        10.000000
Name: totalintlcalls, dtype: float64

Statistics for 'numbercustomerservicecalls':
count    4949.000000
mean        1.519256
std         1.205275
min         0.000000
25%         1.000000
50%         1.000000
75%         2.000000
max         5.000000
Name: numbercustomerservicecalls, dtype: float64

Given the high number of rows that would be discarded, we decided to not remove outliers from the original dataset to not compromise the integrity of the dataset.

In [29]:
data.describe()
Out[29]:
churn accountlength internationalplan voicemailplan numbervmailmessages totaldayminutes totaldaycalls totaldaycharge totaleveminutes totalevecalls totalevecharge totalnightminutes totalnightcalls totalnightcharge totalintlminutes totalintlcalls totalintlcharge numbercustomerservicecalls
count 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.00000 5000.000000 5000.000000 5000.00000 5000.000000 5000.00000
mean 0.141400 100.226784 0.113800 0.282400 7.760680 180.303714 100.035360 30.638822 200.634184 100.250200 17.054007 200.468532 99.92300 9.017016 10.262268 4.42888 2.771706 1.56956
std 0.348469 39.554740 0.347669 0.471905 13.499377 53.725615 19.766022 9.148733 50.291279 19.759087 4.293597 50.276816 19.85791 2.271356 2.752844 2.43862 0.741653 1.29969
min 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 0.00000 0.000000 0.00000
25% 0.000000 73.000000 0.000000 0.000000 0.000000 143.900000 87.000000 24.430000 166.800000 87.000000 14.140000 167.300000 87.00000 7.510000 8.500000 3.00000 2.309000 1.00000
50% 0.000000 100.000000 0.000000 0.000000 0.000000 180.050000 100.000000 30.600000 201.000000 101.000000 17.090000 200.500000 100.00000 9.020000 10.300000 4.00000 2.780000 1.00000
75% 0.000000 127.000000 0.000000 1.000000 16.000000 216.000000 113.000000 36.750000 233.800000 113.000000 19.900000 234.125000 113.00000 10.557000 12.000000 6.00000 3.240000 2.00000
max 1.000000 243.000000 2.000000 2.000000 52.000000 351.500000 165.000000 59.760000 363.700000 170.000000 30.910000 395.000000 175.00000 17.770000 20.000000 20.00000 5.400000 9.00000

3.3 - Data Reduction: Feature Selection¶

While there are seventeen features available to predict the target, analyzing all the data objects can be a time-consuming and expensive process. It is also important to ensure that the analysis is focused on the most relevant information and to measure the impact of considering different sets of features.

Our approach included:

1) Identifying Redundant Features through a Correlation Matrix. Correlation matrix is a commonly used method in which the correlation for each pair of numerical variables is computed. With this, it is possible to filter out variables with low correlation to the target and remove redundant variables.

2) Applying Ensemble Methods: We used the Boruta Method to feature ranking from a random forest approach. Boruta algorithm is a wrapper built around the random forest classification and it gives a numerical estimate of the feature importance.

3) Recursive Feature Elimination (RFE): This wrapper method uses a recursive algorithm to iteratively remove the least significant attributes until a final set of the most important variables is ahieved.

Note: We had consider to use Principal Components Analysis (PCA), but ultimately decided not because it is primarily designed for numerical features and may not be directly applicable to categorical or discrete features. PCA works by identifying linear combinations of features that capture the maximum variance in the data and while it is technically possible to use PCA on categorical variables (by hot encoding them), that is not an optimal approach because the encoding process introduces artificial relationships between categories that may not reflect the true underlying structure of the data.

In [30]:
correlation_matrix = data.corr(numeric_only = True)
correlation_matrix
Out[30]:
churn accountlength internationalplan voicemailplan numbervmailmessages totaldayminutes totaldaycalls totaldaycharge totaleveminutes totalevecalls totalevecharge totalnightminutes totalnightcalls totalnightcharge totalintlminutes totalintlcalls totalintlcharge numbercustomerservicecalls
churn 1.000000 0.017599 0.223803 -0.096899 -0.098410 0.204437 0.015782 0.205893 0.089790 -0.007707 0.089355 0.043279 -0.007284 0.045688 0.061712 -0.044919 0.059816 0.216391
accountlength 0.017599 1.000000 0.004607 -0.020104 -0.017260 -0.001634 0.026951 -0.000937 -0.007911 0.011359 -0.008081 0.003669 -0.014938 0.003219 0.000348 0.013088 -0.000795 -0.001303
internationalplan 0.223803 0.004607 1.000000 -0.002055 0.006188 0.022611 0.000579 0.023479 0.032190 -0.018647 0.032162 -0.007305 0.016493 -0.009109 0.022755 0.002776 0.021917 -0.015707
voicemailplan -0.096899 -0.020104 -0.002055 1.000000 0.876245 -0.000694 0.006088 -0.000403 0.030899 -0.003010 0.030819 0.004525 0.010740 0.004073 -0.002994 -0.000448 -0.003718 -0.005681
numbervmailmessages -0.098410 -0.017260 0.006188 0.876245 1.000000 0.007037 0.003585 0.007447 0.020998 -0.007010 0.021033 0.005377 0.000477 0.005103 0.005578 -0.004427 0.003519 -0.009512
totaldayminutes 0.204437 -0.001634 0.022611 -0.000694 0.007037 1.000000 0.002713 0.989143 -0.010498 0.008244 -0.011064 0.009606 0.004540 0.009887 -0.021864 0.002713 -0.017980 -0.004387
totaldaycalls 0.015782 0.026951 0.000579 0.006088 0.003585 0.002713 1.000000 0.000978 -0.000310 0.002941 0.000457 0.005971 -0.005723 0.004675 0.015280 0.011587 0.012926 -0.008329
totaldaycharge 0.205893 -0.000937 0.023479 -0.000403 0.007447 0.989143 0.000978 1.000000 -0.008848 0.008896 -0.009529 0.011760 0.005581 0.012025 -0.022457 0.000308 -0.018685 -0.000983
totaleveminutes 0.089790 -0.007911 0.032190 0.030899 0.020998 -0.010498 -0.000310 -0.008848 1.000000 0.004539 0.993895 -0.017636 0.013641 -0.017142 0.000512 0.010567 -0.005727 -0.012021
totalevecalls -0.007707 0.011359 -0.018647 -0.003010 -0.007010 0.008244 0.002941 0.008896 0.004539 1.000000 0.003882 0.003286 -0.013501 0.002488 -0.006834 0.001308 -0.005892 0.005579
totalevecharge 0.089355 -0.008081 0.032162 0.030819 0.021033 -0.011064 0.000457 -0.009529 0.993895 0.003882 1.000000 -0.017165 0.011710 -0.016676 0.001482 0.010663 -0.004732 -0.012234
totalnightminutes 0.043279 0.003669 -0.007305 0.004525 0.005377 0.009606 0.005971 0.011760 -0.017636 0.003286 -0.017165 1.000000 0.025320 0.993036 -0.006254 -0.019283 -0.005600 -0.006955
totalnightcalls -0.007284 -0.014938 0.016493 0.010740 0.000477 0.004540 -0.005723 0.005581 0.013641 -0.013501 0.011710 0.025320 1.000000 0.027563 0.000101 -0.001553 0.002806 -0.006382
totalnightcharge 0.045688 0.003219 -0.009109 0.004073 0.005103 0.009887 0.004675 0.012025 -0.017142 0.002488 -0.016676 0.993036 0.027563 1.000000 -0.007700 -0.018429 -0.007068 -0.008293
totalintlminutes 0.061712 0.000348 0.022755 -0.002994 0.005578 -0.021864 0.015280 -0.022457 0.000512 -0.006834 0.001482 -0.006254 0.000101 -0.007700 1.000000 0.018314 0.984733 -0.011243
totalintlcalls -0.044919 0.013088 0.002776 -0.000448 -0.004427 0.002713 0.011587 0.000308 0.010567 0.001308 0.010663 -0.019283 -0.001553 -0.018429 0.018314 1.000000 0.021163 -0.018942
totalintlcharge 0.059816 -0.000795 0.021917 -0.003718 0.003519 -0.017980 0.012926 -0.018685 -0.005727 -0.005892 -0.004732 -0.005600 0.002806 -0.007068 0.984733 0.021163 1.000000 -0.011175
numbercustomerservicecalls 0.216391 -0.001303 -0.015707 -0.005681 -0.009512 -0.004387 -0.008329 -0.000983 -0.012021 0.005579 -0.012234 -0.006955 -0.006382 -0.008293 -0.011243 -0.018942 -0.011175 1.000000
In [31]:
plt.figure(figsize=(12, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.show()
No description has been provided for this image

From the Correlation Matrix and the Bivariate analysis performed, we detected a few highly correlated variables that may be redundant to the problem. This was the case of the 'total count of minutes' and 'total charge' variables within each group, which corroborates our own business perception of the problem, since normally, the charge value of a call is calculated directly from the minute’s duration of that call.

With this in mind, removing the variables: totaldayminutes, totaleveminutes, totalnightcharge, totalintminutes could be a valid approach since between each pair they were the ones less correlated with the label feature and removing them would allow reduce the dimensionality of the dataset without compromising drastically the predictive performance of the future model.

Boruta Method¶

In [32]:
np.int = np.int32
np.float = np.float64
np.bool = np.bool_
In [33]:
X = data.drop('churn', axis=1)
y = data['churn']

forest = RandomForestClassifier(n_jobs=-1, class_weight='balanced', max_depth=5, random_state=42)

boruta_selector = BorutaPy(forest, n_estimators='auto', verbose=2, random_state=42)
boruta_selector.fit(X.values, y.values)
Iteration: 	1 / 100
Confirmed: 	0
Tentative: 	17
Rejected: 	0
Iteration: 	2 / 100
Confirmed: 	0
Tentative: 	17
Rejected: 	0
Iteration: 	3 / 100
Confirmed: 	0
Tentative: 	17
Rejected: 	0
Iteration: 	4 / 100
Confirmed: 	0
Tentative: 	17
Rejected: 	0
Iteration: 	5 / 100
Confirmed: 	0
Tentative: 	17
Rejected: 	0
Iteration: 	6 / 100
Confirmed: 	0
Tentative: 	17
Rejected: 	0
Iteration: 	7 / 100
Confirmed: 	0
Tentative: 	17
Rejected: 	0
Iteration: 	8 / 100
Confirmed: 	13
Tentative: 	0
Rejected: 	4


BorutaPy finished running.

Iteration: 	9 / 100
Confirmed: 	13
Tentative: 	0
Rejected: 	4
Out[33]:
BorutaPy(estimator=RandomForestClassifier(class_weight='balanced', max_depth=5,
                                          n_estimators=116, n_jobs=-1,
                                          random_state=RandomState(MT19937) at 0x2A81CE80840),
         n_estimators='auto',
         random_state=RandomState(MT19937) at 0x2A81CE80840, verbose=2)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
BorutaPy(estimator=RandomForestClassifier(class_weight='balanced', max_depth=5,
                                          n_estimators=116, n_jobs=-1,
                                          random_state=RandomState(MT19937) at 0x2A81CE80840),
         n_estimators='auto',
         random_state=RandomState(MT19937) at 0x2A81CE80840, verbose=2)
RandomForestClassifier(class_weight='balanced', max_depth=5, n_estimators=116,
                       n_jobs=-1,
                       random_state=RandomState(MT19937) at 0x2A81CE80840)
RandomForestClassifier(class_weight='balanced', max_depth=5, n_estimators=116,
                       n_jobs=-1,
                       random_state=RandomState(MT19937) at 0x2A81CE80840)
In [34]:
selected_features= X.columns[boruta_selector.support_].tolist()
removed_features= X.columns[~boruta_selector.support_].tolist()

print("Selected Features:", selected_features)
print("Removed Features:", removed_features)
Selected Features: ['internationalplan', 'voicemailplan', 'numbervmailmessages', 'totaldayminutes', 'totaldaycharge', 'totaleveminutes', 'totalevecharge', 'totalnightminutes', 'totalnightcharge', 'totalintlminutes', 'totalintlcalls', 'totalintlcharge', 'numbercustomerservicecalls']
Removed Features: ['accountlength', 'totaldaycalls', 'totalevecalls', 'totalnightcalls']
In [74]:
forest.fit(X.values, y.values)

feature_importances = forest.feature_importances_

feature_names = X.columns

sorted_idx = np.argsort(feature_importances)[::-1]

plt.figure(figsize=(12, 6))
plt.bar(range(X.shape[1]), feature_importances[sorted_idx], align="center")
plt.xticks(range(X.shape[1]), np.array(feature_names)[sorted_idx], rotation=90)
plt.xlabel("Feature")
plt.ylabel("Importance")
plt.title("Feature Importances after Boruta")
plt.show()
No description has been provided for this image

Recursive Feature Elimination¶

First we trainned our model with different values of "n_features_to_select" and evaluate its performance using a validation set or cross-validation.

In [75]:
feature_range = range(9, len(X.columns) + 1,2)

cv_scores = []

for n in feature_range:
    rfe_selector = RFE(estimator=forest, n_features_to_select=n, step=1)
    scores = cross_val_score(rfe_selector, X, y, cv=5, scoring='accuracy')
    cv_scores.append(scores.mean())

plt.plot(feature_range, cv_scores)
plt.xlabel('Number of Features Selected')
plt.ylabel('Cross-Validation Accuracy')
plt.title('RFE Performance')
plt.show()
No description has been provided for this image

Looking for an "elbow" point in the plot where the performance gain starts to diminish, we choose the value of features to be selected as 11.

In [37]:
n_features_to_select = min(11, X.shape[1])

rfe_selector = RFE(estimator=forest, n_features_to_select=n_features_to_select, step=1)
rfe_selector.fit(X, y)

final_selected_features = X.columns[rfe_selector.support_]
removed_features = X.columns[~rfe_selector.support_]

feature_importances = rfe_selector.estimator_.feature_importances_
sorted_idx = np.argsort(feature_importances)[::-1]

plt.figure(figsize=(12, 6))
plt.bar(range(len(final_selected_features)), feature_importances[sorted_idx], align="center")
plt.xticks(range(len(final_selected_features)), np.array(final_selected_features)[sorted_idx], rotation=90)
plt.xlabel("Feature")
plt.ylabel("Importance")
plt.title("Feature Importances after RFE")
plt.show()

print("Final Selected Features:", final_selected_features)
print("Removed Features:", removed_features)
No description has been provided for this image
Final Selected Features: Index(['internationalplan', 'voicemailplan', 'numbervmailmessages',
       'totaldayminutes', 'totaldaycharge', 'totaleveminutes',
       'totalevecharge', 'totalintlminutes', 'totalintlcalls',
       'totalintlcharge', 'numbercustomerservicecalls'],
      dtype='object')
Removed Features: Index(['accountlength', 'totaldaycalls', 'totalevecalls', 'totalnightminutes',
       'totalnightcalls', 'totalnightcharge'],
      dtype='object')

It's interesting to observe that both Boruta and Recursive Feature Elimination (RFE) independently interced the same variables for removal (the 4 variables to be excluded in the boruta method are also present in the removed features of the RFE method). This could indicate that these particular features might indeed have lower importance or contribution to the predictive performance of the model.

A new dataset without these 4 variables might be a reasonable approach.

3.4 - Construct Data: Discretising¶

Discretizing data converts continuous data into discrete or categorical data, which makes data more manageable for machine learning algorithms. It reduces data dimensionality, improves data quality and enhances model performance, by making the data more compatible with the algorithms and reducing overfitting.

In [5]:
accountlength_array = data['accountlength'].to_numpy()
pd.cut(accountlength_array, 10)
Out[5]:
[(122.0, 146.2], (97.8, 122.0], (122.0, 146.2], (73.6, 97.8], (73.6, 97.8], ..., (49.4, 73.6], (146.2, 170.4], (49.4, 73.6], (97.8, 122.0], (73.6, 97.8]]
Length: 5000
Categories (10, interval[float64, right]): [(0.758, 25.2] < (25.2, 49.4] < (49.4, 73.6] < (73.6, 97.8] ... (146.2, 170.4] < (170.4, 194.6] < (194.6, 218.8] < (218.8, 243.0]]
In [6]:
totaldayminutes_array = data['totaldayminutes'].to_numpy()
pd.cut(totaldayminutes_array, 10)
Out[6]:
[(246.05, 281.2], (140.6, 175.75], (210.9, 246.05], (281.2, 316.35], (140.6, 175.75], ..., (210.9, 246.05], (175.75, 210.9], (105.45, 140.6], (175.75, 210.9], (105.45, 140.6]]
Length: 5000
Categories (10, interval[float64, right]): [(-0.352, 35.15] < (35.15, 70.3] < (70.3, 105.45] < (105.45, 140.6] ... (210.9, 246.05] < (246.05, 281.2] < (281.2, 316.35] < (316.35, 351.5]]
In [7]:
totaldaycalls_array = data['totaldaycalls'].to_numpy()
pd.cut(totaldaycalls_array, 10)
Out[7]:
[(99.0, 115.5], (115.5, 132.0], (99.0, 115.5], (66.0, 82.5], (99.0, 115.5], ..., (115.5, 132.0], (82.5, 99.0], (82.5, 99.0], (66.0, 82.5], (99.0, 115.5]]
Length: 5000
Categories (10, interval[float64, right]): [(-0.165, 16.5] < (16.5, 33.0] < (33.0, 49.5] < (49.5, 66.0] ... (99.0, 115.5] < (115.5, 132.0] < (132.0, 148.5] < (148.5, 165.0]]
In [8]:
totaldaycharge_array = data['totaldaycharge'].to_numpy()
pd.cut(totaldaycharge_array, 10)
Out[8]:
[(41.832, 47.808], (23.904, 29.88], (35.856, 41.832], (47.808, 53.784], (23.904, 29.88], ..., (35.856, 41.832], (29.88, 35.856], (17.928, 23.904], (29.88, 35.856], (17.928, 23.904]]
Length: 5000
Categories (10, interval[float64, right]): [(-0.0598, 5.976] < (5.976, 11.952] < (11.952, 17.928] < (17.928, 23.904] ... (35.856, 41.832] < (41.832, 47.808] < (47.808, 53.784] < (53.784, 59.76]]
In [9]:
totaleveminutes_array = data['totaleveminutes'].to_numpy()
pd.cut(totaleveminutes_array, 10)
Out[9]:
[(181.85, 218.22], (181.85, 218.22], (109.11, 145.48], (36.37, 72.74], (145.48, 181.85], ..., (218.22, 254.59], (254.59, 290.96], (145.48, 181.85], (145.48, 181.85], (254.59, 290.96]]
Length: 5000
Categories (10, interval[float64, right]): [(-0.364, 36.37] < (36.37, 72.74] < (72.74, 109.11] < (109.11, 145.48] ... (218.22, 254.59] < (254.59, 290.96] < (290.96, 327.33] < (327.33, 363.7]]
In [10]:
totalevecalls_array = data['totalevecalls'].to_numpy()
pd.cut(totalevecalls_array, 10)
Out[10]:
[(85.0, 102.0], (102.0, 119.0], (102.0, 119.0], (85.0, 102.0], (119.0, 136.0], ..., (119.0, 136.0], (68.0, 85.0], (119.0, 136.0], (85.0, 102.0], (102.0, 119.0]]
Length: 5000
Categories (10, interval[float64, right]): [(-0.17, 17.0] < (17.0, 34.0] < (34.0, 51.0] < (51.0, 68.0] ... (102.0, 119.0] < (119.0, 136.0] < (136.0, 153.0] < (153.0, 170.0]]
In [11]:
totalevecharge_array = data['totalevecharge'].to_numpy()
pd.cut(totalevecharge_array, 10)
Out[11]:
[(15.455, 18.546], (15.455, 18.546], (9.273, 12.364], (3.091, 6.182], (12.364, 15.455], ..., (18.546, 21.637], (21.637, 24.728], (12.364, 15.455], (12.364, 15.455], (21.637, 24.728]]
Length: 5000
Categories (10, interval[float64, right]): [(-0.0309, 3.091] < (3.091, 6.182] < (6.182, 9.273] < (9.273, 12.364] ... (18.546, 21.637] < (21.637, 24.728] < (24.728, 27.819] < (27.819, 30.91]]
In [12]:
totalnightminutes_array = data['totalnightminutes'].to_numpy()
pd.cut(totalnightminutes_array, 10)
Out[12]:
[(237.0, 276.5], (237.0, 276.5], (158.0, 197.5], (158.0, 197.5], (158.0, 197.5], ..., (276.5, 316.0], (197.5, 237.0], (197.5, 237.0], (197.5, 237.0], (118.5, 158.0]]
Length: 5000
Categories (10, interval[float64, right]): [(-0.395, 39.5] < (39.5, 79.0] < (79.0, 118.5] < (118.5, 158.0] ... (237.0, 276.5] < (276.5, 316.0] < (316.0, 355.5] < (355.5, 395.0]]
In [13]:
totalnightcalls_array = data['totalnightcalls'].to_numpy()
pd.cut(totalnightcalls_array, 10)
Out[13]:
[(87.5, 105.0], (87.5, 105.0], (87.5, 105.0], (87.5, 105.0], (105.0, 122.5], ..., (105.0, 122.5], (105.0, 122.5], (87.5, 105.0], (87.5, 105.0], (87.5, 105.0]]
Length: 5000
Categories (10, interval[float64, right]): [(-0.175, 17.5] < (17.5, 35.0] < (35.0, 52.5] < (52.5, 70.0] ... (105.0, 122.5] < (122.5, 140.0] < (140.0, 157.5] < (157.5, 175.0]]
In [14]:
totalnightcharge_array = data['totalnightcharge'].to_numpy()
pd.cut(totalnightcharge_array, 10)
Out[14]:
[(10.662, 12.439], (10.662, 12.439], (7.108, 8.885], (7.108, 8.885], (7.108, 8.885], ..., (12.439, 14.216], (8.885, 10.662], (8.885, 10.662], (8.885, 10.662], (5.331, 7.108]]
Length: 5000
Categories (10, interval[float64, right]): [(-0.0178, 1.777] < (1.777, 3.554] < (3.554, 5.331] < (5.331, 7.108] ... (10.662, 12.439] < (12.439, 14.216] < (14.216, 15.993] < (15.993, 17.77]]
In [15]:
totalintlminutes_array = data['totalintlminutes'].to_numpy()
pd.cut(totalintlminutes_array, 10)
Out[15]:
[(8.0, 10.0], (12.0, 14.0], (12.0, 14.0], (6.0, 8.0], (10.0, 12.0], ..., (8.0, 10.0], (14.0, 16.0], (12.0, 14.0], (8.0, 10.0], (8.0, 10.0]]
Length: 5000
Categories (10, interval[float64, right]): [(-0.02, 2.0] < (2.0, 4.0] < (4.0, 6.0] < (6.0, 8.0] ... (12.0, 14.0] < (14.0, 16.0] < (16.0, 18.0] < (18.0, 20.0]]
In [16]:
totalintlcharge_array = data['totalintlcharge'].to_numpy()
pd.cut(totalintlcharge_array, 10)
Out[16]:
[(2.16, 2.7], (3.24, 3.78], (3.24, 3.78], (1.62, 2.16], (2.7, 3.24], ..., (2.16, 2.7], (3.78, 4.32], (3.24, 3.78], (2.16, 2.7], (2.16, 2.7]]
Length: 5000
Categories (10, interval[float64, right]): [(-0.0054, 0.54] < (0.54, 1.08] < (1.08, 1.62] < (1.62, 2.16] ... (3.24, 3.78] < (3.78, 4.32] < (4.32, 4.86] < (4.86, 5.4]]

One of the ways to select the number of bins into which the data will be discretized is through visualization. By plotting the data distribution in the previous step, we observe that the number of bins that might better fit our data is 10.

For continuous variables with a somewhat normal distribution, we decided to do equal-width binning. This divides the range of the data into a fixed number of bins of equal width. This method can be problematic if the data is not normally distributed.

In [17]:
pd.qcut(data['numbervmailmessages'], 10, duplicates='drop')
Out[17]:
0         (24.0, 32.0]
1         (24.0, 32.0]
2       (-0.001, 24.0]
3       (-0.001, 24.0]
4       (-0.001, 24.0]
             ...      
4995      (32.0, 52.0]
4996    (-0.001, 24.0]
4997    (-0.001, 24.0]
4998    (-0.001, 24.0]
4999      (32.0, 52.0]
Name: numbervmailmessages, Length: 5000, dtype: category
Categories (3, interval[float64, right]): [(-0.001, 24.0] < (24.0, 32.0] < (32.0, 52.0]]
In [18]:
pd.qcut(data['numbervmailmessages'], 10, duplicates='drop').value_counts()
Out[18]:
numbervmailmessages
(-0.001, 24.0]    4056
(24.0, 32.0]       499
(32.0, 52.0]       445
Name: count, dtype: int64
In [19]:
pd.qcut(data['totalintlcalls'], 10, duplicates='drop')
Out[19]:
0          (2.0, 3.0]
1          (2.0, 3.0]
2          (4.0, 5.0]
3          (6.0, 8.0]
4          (2.0, 3.0]
            ...      
4995       (4.0, 5.0]
4996    (-0.001, 2.0]
4997       (3.0, 4.0]
4998       (5.0, 6.0]
4999      (8.0, 20.0]
Name: totalintlcalls, Length: 5000, dtype: category
Categories (7, interval[float64, right]): [(-0.001, 2.0] < (2.0, 3.0] < (3.0, 4.0] < (4.0, 5.0] < (5.0, 6.0] < (6.0, 8.0] < (8.0, 20.0]]
In [20]:
pd.qcut(data['totalintlcalls'], 10, duplicates='drop').value_counts()
Out[20]:
totalintlcalls
(-0.001, 2.0]    1018
(2.0, 3.0]        993
(3.0, 4.0]        964
(4.0, 5.0]        715
(5.0, 6.0]        496
(6.0, 8.0]        480
(8.0, 20.0]       334
Name: count, dtype: int64
In [21]:
pd.qcut(data['numbercustomerservicecalls'], 10, duplicates='drop')
Out[21]:
0       (-0.001, 1.0]
1       (-0.001, 1.0]
2       (-0.001, 1.0]
3          (1.0, 2.0]
4          (2.0, 3.0]
            ...      
4995       (1.0, 2.0]
4996       (2.0, 3.0]
4997    (-0.001, 1.0]
4998    (-0.001, 1.0]
4999    (-0.001, 1.0]
Name: numbercustomerservicecalls, Length: 5000, dtype: category
Categories (4, interval[float64, right]): [(-0.001, 1.0] < (1.0, 2.0] < (2.0, 3.0] < (3.0, 9.0]]
In [22]:
pd.qcut(data['numbercustomerservicecalls'], 10, duplicates='drop').value_counts()
Out[22]:
numbercustomerservicecalls
(-0.001, 1.0]    2788
(1.0, 2.0]       1152
(2.0, 3.0]        668
(3.0, 9.0]        392
Name: count, dtype: int64

For continuous variables with non-normal distributions, we chose equal-frequency binning. This divides the data into a fixed number of bins of equal frequency. This method is more robust to non-normal distributions but can be more difficult to implement. The goal is to create bins with approximately the same number of data points.

The resulting number of categories may be different from the specified number of bins, which happened to our data. This can occur due to not being uniformly distributed, there may be some bins with more or fewer data points than others. The equal-frequency binning algorithm may have to create fewer bins to accommodate the smaller number of data points in some categories. Furthermore, some algorithms may be more aggressive in rounding data points to the nearest bin boundary, which can result in fewer categories. Other algorithms may be more lenient, allowing for a wider range of data points to fall into each bin. Also, precision of the binning can also affect the number of categories.

4. Modelling¶

4.1 Data shuffle and split¶

First we decided to shuffle the data before splitting it into tranning and test data, in order to make it an unbiased division. We decided to split the data allocating 80% for training, instead of 70%, because having more data for training can lead to more accurate models. On the other hand, allocating 20%, instead of 30%, for testing may provide a lesser robust evaluation.

In [23]:
data_shuffled = data.sample(frac = 1, random_state=10)
In [24]:
data_shuffled
Out[24]:
churn accountlength internationalplan voicemailplan numbervmailmessages totaldayminutes totaldaycalls totaldaycharge totaleveminutes totalevecalls totalevecharge totalnightminutes totalnightcalls totalnightcharge totalintlminutes totalintlcalls totalintlcharge numbercustomerservicecalls
245 0.0 22.0 0.0 0.0 0.0 110.3 107.0 18.75 166.5 93.0 14.15 202.3 96.0 9.10 9.5 5.0 2.57 0.0
4493 0.0 83.0 0.0 1.0 37.0 133.1 117.0 22.63 171.4 89.0 14.57 218.1 92.0 9.81 10.9 4.0 2.94 3.0
4583 0.0 104.0 0.0 0.0 0.0 246.4 108.0 41.89 205.5 102.0 17.47 208.1 125.0 9.36 14.1 8.0 3.81 2.0
2242 0.0 192.0 0.0 0.0 0.0 185.0 88.0 31.45 224.9 98.0 19.12 212.4 105.0 9.56 11.4 3.0 3.08 2.0
3407 0.0 93.0 0.0 0.0 0.0 120.6 104.0 20.50 205.5 95.0 17.47 182.5 107.0 8.21 9.6 6.0 2.59 2.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1180 0.0 72.0 0.0 0.0 0.0 207.8 92.0 35.33 195.7 110.0 16.63 184.8 124.0 8.32 13.1 4.0 3.54 0.0
3441 0.0 56.0 0.0 0.0 0.0 254.8 83.0 43.32 194.0 98.0 16.49 186.2 118.0 8.38 13.5 2.0 3.65 2.2
1344 1.0 111.0 0.0 0.0 0.0 284.4 89.0 48.35 211.5 113.0 13.35 242.8 91.0 10.93 8.4 8.0 2.27 0.0
4623 0.0 74.0 0.0 0.0 0.0 207.1 79.0 35.21 182.0 100.0 15.47 233.7 73.0 10.52 7.4 8.0 2.00 1.0
1289 0.0 129.0 0.0 0.0 0.0 98.0 99.0 16.66 240.7 62.0 20.46 254.8 123.0 11.47 10.5 4.0 2.84 0.0

5000 rows × 18 columns

In [25]:
train_data, test_data = train_test_split(data_shuffled, test_size=0.2, random_state=10)
In [26]:
train_data
Out[26]:
churn accountlength internationalplan voicemailplan numbervmailmessages totaldayminutes totaldaycalls totaldaycharge totaleveminutes totalevecalls totalevecharge totalnightminutes totalnightcalls totalnightcharge totalintlminutes totalintlcalls totalintlcharge numbercustomerservicecalls
3439 1.0 110.0 0.0 0.0 0.0 220.9 100.0 37.550 279.6 76.0 23.77 239.8 107.0 10.79 13.5 9.0 3.650 1.0
4509 0.0 128.0 0.0 1.0 40.0 272.2 112.0 46.270 239.9 111.0 20.39 257.3 124.0 11.58 11.6 2.0 3.130 1.0
945 0.0 122.0 0.0 0.0 0.0 180.0 88.0 30.600 145.0 95.4 12.33 233.7 120.0 10.52 11.5 6.0 3.110 2.0
537 0.0 190.0 0.0 1.0 22.0 166.5 93.0 28.310 183.0 92.0 15.56 121.0 102.0 5.44 8.5 3.0 2.300 0.0
4887 0.0 116.0 0.0 0.0 0.0 152.4 101.0 25.910 191.0 90.0 16.24 143.0 96.0 6.44 11.1 9.0 3.000 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3619 0.0 81.0 0.0 1.0 36.0 177.6 106.0 28.020 210.2 138.0 17.87 173.6 78.0 7.81 4.3 2.0 1.160 2.0
456 0.0 60.0 0.0 0.0 0.0 98.2 88.0 16.690 180.5 69.0 15.34 223.6 69.0 10.06 9.3 2.0 2.510 2.0
3686 1.0 64.0 0.0 0.0 0.0 297.8 88.0 49.454 196.0 89.0 16.66 213.7 107.0 9.62 12.0 2.0 2.482 1.0
1677 0.0 123.0 0.0 0.0 0.0 163.1 119.0 27.730 249.4 51.0 21.20 168.2 77.0 7.57 9.0 10.0 2.430 1.0
4996 1.0 152.0 0.0 0.0 0.0 184.2 90.0 31.310 256.8 73.0 21.83 213.6 113.0 9.61 14.7 2.0 3.970 3.0

4000 rows × 18 columns

In [27]:
test_data
Out[27]:
churn accountlength internationalplan voicemailplan numbervmailmessages totaldayminutes totaldaycalls totaldaycharge totaleveminutes totalevecalls totalevecharge totalnightminutes totalnightcalls totalnightcharge totalintlminutes totalintlcalls totalintlcharge numbercustomerservicecalls
845 0.0 144.0 0.0 1.0 51.0 283.9 98.0 48.26 192.0 109.0 16.32 196.3 85.0 8.83 10.00 4.0 2.70 1.0
4675 0.0 52.0 0.0 0.0 0.0 165.9 122.0 28.20 212.2 107.0 18.04 202.1 94.0 9.09 8.30 3.0 2.24 1.0
1672 0.0 95.0 0.0 0.0 0.0 203.4 96.0 34.58 168.6 61.0 14.33 173.0 105.0 7.79 13.70 3.0 3.70 2.0
3424 0.0 82.0 0.0 0.0 0.0 183.6 111.0 31.21 303.5 88.0 25.80 189.8 115.0 8.54 10.30 6.0 2.78 0.0
4756 0.0 84.0 0.0 0.0 0.0 172.3 92.0 29.29 224.2 96.0 19.06 223.4 100.0 10.05 9.70 1.0 2.62 2.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2646 0.0 101.0 0.0 2.0 0.0 232.7 114.0 39.56 186.4 123.0 15.84 153.3 122.0 6.90 11.50 6.0 3.11 5.0
1637 0.0 74.0 0.0 0.0 0.0 225.2 93.0 38.28 215.1 120.0 18.28 241.8 95.0 10.88 9.74 2.0 2.46 2.0
4045 0.0 122.0 0.0 1.0 21.0 196.0 104.0 33.32 194.7 73.0 16.55 150.8 92.0 6.79 9.60 4.0 2.59 0.0
1093 0.0 210.0 0.0 0.0 0.0 104.6 121.0 17.78 149.5 71.0 12.71 255.1 67.0 11.48 6.50 8.0 1.76 2.0
570 0.0 51.0 0.0 0.0 0.0 181.5 108.0 30.86 196.9 87.0 16.74 187.2 119.0 8.42 10.30 2.0 2.78 1.0

1000 rows × 18 columns

4.2 - Sampling¶

Modeling with an imbalanced dataset, where one class is significantly more prevalent than the other, can have several implications: the model might become biased toward the majority class, might have poor generalization to the minority class and accuracy can become a misleading metric.

Due to the highly unbalanced nature of our data set, we performed two sampling techniques to improve the performance of the models. We used the Synthetic Minority Over-sampling Technique (SMOTE) for over-sampling the minority class (churn = 1) and the Random Under-Sampling technique to under-sample the majority class (churn = 0).

As a result, we obtained a balanced dataset where each class has exactly the same number of instances.

In [28]:
X_train = train_data.drop('churn', axis=1)
y_train = np.array(train_data['churn'])

X_test = test_data.drop('churn', axis=1)
y_test = np.array(test_data['churn'])
In [29]:
pipeline = make_pipeline(SMOTE(sampling_strategy=0.5, random_state=10), RandomUnderSampler(sampling_strategy=1.0, random_state=10))
X_train_balanced, y_train_balanced = pipeline.fit_resample(X_train,y_train)

X_train_balanced
Out[29]:
accountlength internationalplan voicemailplan numbervmailmessages totaldayminutes totaldaycalls totaldaycharge totaleveminutes totalevecalls totalevecharge totalnightminutes totalnightcalls totalnightcharge totalintlminutes totalintlcalls totalintlcharge numbercustomerservicecalls
3524 22.000000 0.000000 0.0 0.000000 222.900000 97.000000 37.890000 172.600000 91.000000 14.670000 187.600000 99.000000 8.440000 10.100000 3.000000 2.730000 1.000000
2478 172.000000 0.000000 0.0 0.000000 211.700000 100.000000 35.990000 198.700000 101.000000 16.890000 301.700000 136.000000 13.580000 6.500000 9.000000 1.760000 1.000000
717 96.000000 0.000000 0.0 0.000000 170.500000 86.000000 28.990000 277.500000 88.000000 23.590000 162.500000 117.000000 7.310000 12.200000 6.000000 3.290000 1.000000
3448 86.000000 0.000000 0.0 0.000000 178.300000 94.000000 30.310000 264.400000 74.000000 22.470000 284.600000 109.000000 12.810000 15.000000 2.000000 4.050000 0.000000
2234 127.000000 0.000000 0.0 0.000000 182.300000 124.000000 30.990000 169.900000 110.000000 14.440000 184.000000 116.000000 8.280000 9.300000 3.000000 2.510000 1.000000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
5152 144.564041 0.503050 1.0 36.012198 240.861576 94.435959 40.943474 220.167675 125.957306 18.717255 191.531715 72.478653 8.616433 7.055794 3.490851 1.906555 1.496950
5153 104.668934 0.000000 0.0 0.000000 162.843956 81.387636 27.681929 190.288161 79.775271 16.176003 143.768254 78.262442 6.473756 9.408748 3.000000 2.542624 5.912519
5154 78.135668 0.946957 0.0 0.000000 220.712955 77.007821 37.523043 276.288609 121.310439 23.480983 239.669738 112.371302 10.787426 11.230262 4.893914 3.033852 1.053043
5155 82.899100 0.000000 0.0 0.000000 288.970929 100.911090 49.127982 233.689910 97.341659 19.860889 223.773327 147.696303 10.072148 12.779720 2.189810 3.451638 1.924076
5156 87.024894 0.000000 0.0 0.000000 93.093592 102.256886 15.826372 212.238268 116.154662 18.041009 225.963824 120.281779 10.171782 8.556356 3.359110 2.310498 3.051112

3438 rows × 17 columns

a) Modelling with the balanced dataset¶

a.1) Nearest neighbor¶

When training a kNN classifier, it is necessary to normalize the features because kNN measures the distance between points, and we do not want some variables to be weighted more than others. The Euclidean Distance is usually used, which is the square root of the sum of the squared differences between two points.

We normalized the data after splitting it into training and test sets so that normalization would not give the model additional information about the test set if we normalized all the data at once.

In [191]:
scaler = StandardScaler()
X_train_balanced_scaled = scaler.fit_transform(X_train_balanced)
X_test_scaled = scaler.transform(X_test)

To start training the model we used a fixed value of k=3, which will later be optimized.

In [192]:
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train_balanced_scaled, y_train_balanced)
Out[192]:
KNeighborsClassifier(n_neighbors=3)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KNeighborsClassifier(n_neighbors=3)
In [193]:
y_pred = knn.predict(X_test_scaled)
In [194]:
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
Accuracy: 0.81

We may be able to get a better accuracy value by optimizing our value of k. In order to do this, we loop through many different k values.

In [196]:
k_values = [i for i in range (1,30)]
scores = []

scaler = StandardScaler()
X_train_balanced_scaled = scaler.fit_transform(X_train_balanced_scaled)

for k in k_values:
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train_balanced_scaled, y_train_balanced)
    y_pred = knn.predict(X_test_scaled)
    score = accuracy_score(y_test, y_pred)
    scores.append(np.mean(score))
In [102]:
sns.lineplot(x = k_values, y = scores, marker = 'o')
plt.xlabel("K Values")
plt.ylabel("Accuracy Score")
Out[102]:
Text(0, 0.5, 'Accuracy Score')
No description has been provided for this image

We can see from our chart that k = 9 and k = 13 appears to have the highest accuracy score. So we decided to choose the lowest one, k = 9.

In [197]:
knn = KNeighborsClassifier(n_neighbors=9)
knn.fit(X_train_balanced_scaled, y_train_balanced)

y_pred = knn.predict(X_test_scaled)

accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
Accuracy: 0.801

These are the accuracy, precision, and recall scores for our model, using the best k score of 9:

In [198]:
knn_b = KNeighborsClassifier(n_neighbors=9)
knn.fit(X_train_balanced_scaled, y_train_balanced)

y_pred = knn.predict(X_test_scaled)

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
    
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)
Accuracy: 0.801
Precision: 0.40217391304347827
Recall: 0.7655172413793103
F1 Score: 0.5273159144893111

Accuracy is the proportion of all the predictions that are correct, precision is the proportion of predicted positive labels that are actually positive, and recall is the proportion of actual positive labels that are correctly identified as positive.

As we can see, the accuracy value is higher using k=9 than it was when we initially used k=3. Although the accuracy of this model is considerably high, precision and recall are not, with recall being under 50%, which is reflected on a low F1 score.

In [199]:
print("Classification Report for X_train_balanced data:")
print(classification_report(y_test, y_pred, zero_division=0))

correct_predictions = (y_pred == y_test)
num_correct = sum(correct_predictions)
print(f"Number of correct predictions: {num_correct}")
Classification Report for X_train_balanced data:
              precision    recall  f1-score   support

         0.0       0.95      0.81      0.87       855
         1.0       0.40      0.77      0.53       145

    accuracy                           0.80      1000
   macro avg       0.68      0.79      0.70      1000
weighted avg       0.87      0.80      0.82      1000

Number of correct predictions: 801
In [200]:
KNN_conf_matrix = confusion_matrix(y_test, y_pred)

plt.figure(figsize=(4, 4))
sns.heatmap(KNN_conf_matrix, annot=True, fmt='g', cmap='Blues', 
            square=True,
            xticklabels=['No', 'Yes'], 
            yticklabels=['No', 'Yes'])
plt.xlabel('Predicted labels')
plt.ylabel('True labels')
plt.title('Confusion Matrix of KNN')
plt.show()
No description has been provided for this image

a.2) Bayesian classifier¶

Bayesian classifiers are a class of probabilistic classifiers based on Bayes' theorem. In this work we used the MixedNB method, that handles both continuous and categorical variables.

First we scaled the numerical features using StandardScaler (normally distributed features) and MinMaxScaler (features that don’t follow a normal distribution). This preprocessing step is essential for scaling numerical features to have comparable ranges and magnitudes, enhancing the performance of this machine learning algorithm.

In [116]:
X_train_balanced = X_train_balanced.loc[:, X_train_balanced.columns != 'churn']
y_train_balanced = y_train_balanced.astype(int)

scaler_standard = StandardScaler()
numeric_features = ['accountlength', 'totaldayminutes', 'totaldaycalls', 'totaldaycharge',
                    'totaleveminutes', 'totalevecalls', 'totalevecharge', 'totalnightminutes', 'totalnightcalls',
                    'totalnightcharge', 'totalintlminutes', 'totalintlcharge']  

scaler_minmax = MinMaxScaler()
numeric_features_minmax = ['numbervmailmessages', 'totalintlcalls', 'numbercustomerservicecalls']  

X_numeric_train_balanced = X_train_balanced[numeric_features] 
X_numeric_train_balanced_scaled = scaler_standard.fit_transform(X_numeric_train_balanced)  
X_train_balanced_scaled = X_train_balanced.copy()
X_train_balanced_scaled[numeric_features] = X_numeric_train_balanced_scaled  

X_numeric_minmax = X_train_balanced_scaled[numeric_features_minmax]  
X_numeric_minmax_scaled = scaler_minmax.fit_transform(X_numeric_minmax)  
X_train_balanced_scaled[numeric_features_minmax] = X_numeric_minmax_scaled  

model_NB_b = MixedNB()

model_NB_b.fit(X_train_balanced_scaled, y_train_balanced)  

preds = model_NB_b.predict(X_train_balanced_scaled)

accuracy = accuracy_score(y_train_balanced, preds)
print(f"Accuracy on X_train_balanced data: {accuracy:.2f}")

print("Classification Report for X_train_balanced data:")
print(classification_report(y_train_balanced, preds, zero_division=0))

correct_predictions = (preds == y_train_balanced)
num_correct = sum(correct_predictions)
print(f"Number of correct predictions: {num_correct}")

conf_matrix = confusion_matrix(y_train_balanced, preds)

plt.figure(figsize=(4, 4))
sns.heatmap(conf_matrix, annot=True, fmt='g', cmap='Blues', 
            square=True,
            xticklabels=['No', 'Yes'], 
            yticklabels=['No', 'Yes'])
plt.xlabel('Predicted labels')
plt.ylabel('True labels')
plt.title('Confusion Matrix of NB')
plt.show()
Accuracy on X_train_balanced data: 0.78
Classification Report for X_train_balanced data:
              precision    recall  f1-score   support

           0       0.81      0.74      0.78      1719
           1       0.76      0.82      0.79      1719

    accuracy                           0.78      3438
   macro avg       0.79      0.78      0.78      3438
weighted avg       0.79      0.78      0.78      3438

Number of correct predictions: 2696
No description has been provided for this image

The accuracy achieved on the balanced training indicates a reasonable performance (77%), however presented a balanced precision for both classes (78% versus 76% for not not churned and churn, respectively). Furthermore the recall (75% and 79% for not churned and churn, respectively) and f1-scores (77% and 78% for not churned and churn, respectively) also indicated a balanced performance across the two classes.

a.3) Decision Tree¶

First we decided to develop a not-prunned decision tree, visualize the model and see its scores of accuracy, recal, precision and f1.

In [117]:
model = DecisionTreeClassifier(random_state=0)
model.fit(X_train_balanced,y_train_balanced)
Out[117]:
DecisionTreeClassifier(random_state=0)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(random_state=0)
In [118]:
plt.subplots(nrows = 1,ncols = 1,figsize = (12,12), dpi=200)
tree.plot_tree(model, fontsize=5, filled=True)
plt.title('Decision Tree Not Prunned')
plt.show()
No description has been provided for this image
In [119]:
preds = model.predict(X_train_balanced)
(vals,ans)=np.unique(preds==y_train_balanced, return_counts=True)
ans[vals]/sum(ans)
Out[119]:
array([1.])
In [40]:
def scores(model):
    preds = model.predict(X_test)
    print("Classification Report for test data:")
    print(classification_report(y_test, preds, zero_division=0))
In [121]:
scores(model)
Classification Report for test data:
              precision    recall  f1-score   support

         0.0       0.96      0.90      0.93       855
         1.0       0.58      0.79      0.67       145

    accuracy                           0.89      1000
   macro avg       0.77      0.84      0.80      1000
weighted avg       0.91      0.89      0.89      1000

Despite having 100% accuracy on the training data, the model has a lower accuracy value on the test data, which it is a clear indication of overfitting.

To address this issue, we consider pruning the decision tree by varying some of the hyperparameters:

In [122]:
max_depth=[]
accuracies = []
accuracies_train = []

for n in range(1,20,2):
    model = DecisionTreeClassifier(random_state=0,max_depth=n)
    model.fit(X_train_balanced,y_train_balanced)
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    y_pred_train = model.predict(X_train_balanced)
    accuracy_train =  accuracy_score(y_train_balanced, y_pred_train)
    accuracies_train.append(accuracy_train)
    max_depth.append(n)
    accuracies.append(accuracy)
    print(f"With Max Depth of {n} the accuracy on the train data is {accuracy_train} and the accuracy on the test data is {accuracy}")
With Max Depth of 1 the accuracy on the train data is 0.673065735892961 and the accuracy on the test data is 0.832
With Max Depth of 3 the accuracy on the train data is 0.8528214077952297 and the accuracy on the test data is 0.825
With Max Depth of 5 the accuracy on the train data is 0.9162303664921466 and the accuracy on the test data is 0.904
With Max Depth of 7 the accuracy on the train data is 0.9432809773123909 and the accuracy on the test data is 0.919
With Max Depth of 9 the accuracy on the train data is 0.9616055846422339 and the accuracy on the test data is 0.914
With Max Depth of 11 the accuracy on the train data is 0.9720767888307156 and the accuracy on the test data is 0.907
With Max Depth of 13 the accuracy on the train data is 0.9872018615474113 and the accuracy on the test data is 0.901
With Max Depth of 15 the accuracy on the train data is 0.997673065735893 and the accuracy on the test data is 0.894
With Max Depth of 17 the accuracy on the train data is 1.0 and the accuracy on the test data is 0.886
With Max Depth of 19 the accuracy on the train data is 1.0 and the accuracy on the test data is 0.886
In [123]:
plt.figure(figsize=(10, 6))
plt.plot(max_depth, accuracies_train, label='Train Accuracy', marker='o')
plt.plot(max_depth, accuracies, label='Test Accuracy', marker='o')
plt.title('Decision Tree Accuracy vs. Max Depth')
plt.xlabel('Max Depth')
plt.ylabel('Accuracy')
plt.legend()
plt.grid(True)
plt.show()
No description has been provided for this image

We've conducted a valuable experiment to analyze the effect of varying the max depth of the decision tree on both the training and test accuracy. Our findings show that:

  • As we increase the max depth of the decision tree, the model becomes more complex and is able to capture more intricate details of the training data.
  • The graph suggests that the accuracy on the test set peaks at a max depth of 7. Beyond a max depth of 7, the model starts to overfit the training data and loses its ability to generalize to new, unseen data, resulting in a decrease in accuracy on the test set.
In [124]:
model = DecisionTreeClassifier(random_state=0,max_depth=7)
model.fit(X_train_balanced,y_train_balanced)

plt.subplots(nrows = 1,ncols = 1,figsize = (14,14), dpi=200)
tree.plot_tree(model, fontsize=5, filled=True)
plt.title('Decision Tree with Max Depth Value of 7')
plt.show()
No description has been provided for this image

Manually iterating over a range of values for all the hyperparameters takes a lot of time and is not the most systematic and efficient way to tune a model. Instead we decide to do a cross validation approach using the GridSearchCV, which helps finding the best combination of hyperparameters in a more automated way. In addition, the performance metrics obtained through cross-validation are generally more reliable than a single train/test split.

In [30]:
model = DecisionTreeClassifier(random_state = 0)

param_grid = {
    'max_depth': [None,8,14,16],
    'min_samples_split': [2,5,10],
    'ccp_alpha':[0.0001,0.001,0.01]
}

grid_search = GridSearchCV(model, param_grid, cv=5)
grid_search.fit(X_train_balanced, y_train_balanced)
Out[30]:
GridSearchCV(cv=5, estimator=DecisionTreeClassifier(random_state=0),
             param_grid={'ccp_alpha': [0.0001, 0.001, 0.01],
                         'max_depth': [None, 8, 14, 16],
                         'min_samples_split': [2, 5, 10]})
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=5, estimator=DecisionTreeClassifier(random_state=0),
             param_grid={'ccp_alpha': [0.0001, 0.001, 0.01],
                         'max_depth': [None, 8, 14, 16],
                         'min_samples_split': [2, 5, 10]})
DecisionTreeClassifier(random_state=0)
DecisionTreeClassifier(random_state=0)
In [31]:
best_dt_model = grid_search.best_estimator_

best_dt_model
Out[31]:
DecisionTreeClassifier(ccp_alpha=0.001, max_depth=16, random_state=0)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(ccp_alpha=0.001, max_depth=16, random_state=0)
In [127]:
plt.subplots(nrows = 1,ncols = 1,figsize = (14,14), dpi=200)
tree.plot_tree(best_dt_model, fontsize=5, filled=True)
plt.title('Decision Tree with the best parameters defined from a Cross Validation Method')
plt.show()
No description has been provided for this image
In [128]:
scores(best_dt_model)
Classification Report for test data:
              precision    recall  f1-score   support

         0.0       0.97      0.93      0.95       855
         1.0       0.65      0.82      0.73       145

    accuracy                           0.91      1000
   macro avg       0.81      0.87      0.84      1000
weighted avg       0.92      0.91      0.91      1000

In [135]:
preds = best_dt_model.predict(X_test)
cm = confusion_matrix(y_test, preds)

plt.figure(figsize=(4, 4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar=True)
plt.xlabel('Predicted Labels')
plt.ylabel('True Labels')
plt.title('Confusion Matrix of Decision Tree Model')
plt.show()
No description has been provided for this image

It's important to notice that the GridSearchCV indicate the value of max_depth = None, as the best one, instead of our previous conclusion. This may happen because hyperparameters can interact with each other. The optimal value for one hyperparameter might depend on the values of others and it's possible that the combination of hyperparameters led to a higher max_depth being selected.

Now, with a different approach, we decided to tune the hyperparameters in order to achieve the best possible f1 score. However, both models showed similiar values of accuracy and f1. The reasons can be that the specific hyperparameter values that work well for one metric might also work well for another and we end up with similar models.

In [32]:
model_f1 = DecisionTreeClassifier(random_state = 0)

grid_search_f1 = GridSearchCV(model_f1, param_grid, scoring='f1', cv=5)
grid_search_f1.fit(X_train_balanced, y_train_balanced)
Out[32]:
GridSearchCV(cv=5, estimator=DecisionTreeClassifier(random_state=0),
             param_grid={'ccp_alpha': [0.0001, 0.001, 0.01],
                         'max_depth': [None, 8, 14, 16],
                         'min_samples_split': [2, 5, 10]},
             scoring='f1')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=5, estimator=DecisionTreeClassifier(random_state=0),
             param_grid={'ccp_alpha': [0.0001, 0.001, 0.01],
                         'max_depth': [None, 8, 14, 16],
                         'min_samples_split': [2, 5, 10]},
             scoring='f1')
DecisionTreeClassifier(random_state=0)
DecisionTreeClassifier(random_state=0)
In [33]:
best_dt_model_f1_b = grid_search_f1.best_estimator_

best_dt_model_f1_b
Out[33]:
DecisionTreeClassifier(ccp_alpha=0.001, max_depth=16, random_state=0)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(ccp_alpha=0.001, max_depth=16, random_state=0)
In [138]:
scores(best_dt_model_f1_b)
Classification Report for test data:
              precision    recall  f1-score   support

         0.0       0.97      0.93      0.95       855
         1.0       0.65      0.82      0.73       145

    accuracy                           0.91      1000
   macro avg       0.81      0.87      0.84      1000
weighted avg       0.92      0.91      0.91      1000

In [141]:
preds = best_dt_model_f1_b.predict(X_test)
cm = confusion_matrix(y_test, preds)

plt.figure(figsize=(4, 4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar=True)
plt.xlabel('Predicted Labels')
plt.ylabel('True Labels')
plt.title('Confusion Matrix of the F1 Tunned Decision Tree Model')
plt.show()
No description has been provided for this image

a.4) Tree Ensembles Methods¶

We decided to apply Bagging, Random Forest and Boosting models. Our approach was defined by the following steps:

  1. Tune the number of estimators, number of trees, to be used (generally, the higher the number of trees, more stable are our predictions but the computational cost is involved)
  2. Tune any other hyperparameters
  3. Apply and fit the final model
  4. Evaluate the models and compare them through a ROC curve

In the first step, we fine-tuned the number of estimators with a smaller set of data to speed up the hyperparameter tuning process while still obtaining a reasonable estimate of the model's perfomance.

In [34]:
X_train_tune, X_train_final, y_train_tune, y_train_final = train_test_split(X_train_balanced, y_train_balanced, test_size=0.8, random_state=0)
In [35]:
bag = BaggingClassifier(random_state=0)

param_grid = {
    'n_estimators': [100, 150, 200, 250]
}

grid_search = GridSearchCV(bag, param_grid, cv=3, scoring='accuracy')
grid_search.fit(X_train_tune, y_train_tune)
best_bag_b = grid_search.best_estimator_
In [36]:
best_bag_b
Out[36]:
BaggingClassifier(n_estimators=200, random_state=0)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
BaggingClassifier(n_estimators=200, random_state=0)
In [39]:
base_tree = DecisionTreeClassifier()

best_bag_b.fit(X_train_balanced, y_train_balanced)
Out[39]:
BaggingClassifier(n_estimators=200, random_state=0)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
BaggingClassifier(n_estimators=200, random_state=0)

For the Random Forest model, in addition to the number of trees, we have to define the maximum number of features each individual tree in the forest is allowed to use when splitting a node. We can set it to None, sqrt or log2. If the dataset has very large number of features, this hyperparameter helps in controlling overfitting.

In [41]:
rf = RandomForestClassifier(random_state=0)
param_grid = {
    'n_estimators': [100, 150, 200, 250],
    'max_features': [None, 'sqrt', 'log2']  
}

grid_search = GridSearchCV(rf, param_grid, cv=3, scoring='accuracy')
grid_search.fit(X_train_tune, y_train_tune)
best_rf_b = grid_search.best_estimator_
In [42]:
best_rf_b
Out[42]:
RandomForestClassifier(n_estimators=200, random_state=0)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(n_estimators=200, random_state=0)
In [43]:
best_rf_b.fit(X_train_balanced, y_train_balanced)
scores(best_rf_b)
Classification Report for test data:
              precision    recall  f1-score   support

         0.0       0.97      0.96      0.96       855
         1.0       0.76      0.83      0.79       145

    accuracy                           0.94      1000
   macro avg       0.87      0.89      0.88      1000
weighted avg       0.94      0.94      0.94      1000

In [44]:
model =  AdaBoostClassifier(random_state=0)
param_grid = {
    'n_estimators': [100, 150, 200, 250]
}

grid_search = GridSearchCV(model, param_grid, cv=3, scoring='accuracy')
grid_search.fit(X_train_tune, y_train_tune)
best_ada_b = grid_search.best_estimator_
best_ada_b
Out[44]:
AdaBoostClassifier(n_estimators=100, random_state=0)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
AdaBoostClassifier(n_estimators=100, random_state=0)
In [45]:
best_ada_b.fit(X_train_balanced, y_train_balanced)
scores(best_ada_b)
Classification Report for test data:
              precision    recall  f1-score   support

         0.0       0.94      0.89      0.91       855
         1.0       0.49      0.65      0.56       145

    accuracy                           0.85      1000
   macro avg       0.71      0.77      0.73      1000
weighted avg       0.87      0.85      0.86      1000

In [46]:
model =  GBC(random_state=0)
param_grid = {
    'n_estimators': [100, 150, 200, 250]
}

grid_search = GridSearchCV(model, param_grid, cv=3, scoring='accuracy')
grid_search.fit(X_train_tune, y_train_tune)
best_gbc_b = grid_search.best_estimator_
In [47]:
best_gbc_b.fit(X_train_balanced, y_train_balanced)
scores(best_gbc_b)
Classification Report for test data:
              precision    recall  f1-score   support

         0.0       0.97      0.96      0.96       855
         1.0       0.77      0.82      0.80       145

    accuracy                           0.94      1000
   macro avg       0.87      0.89      0.88      1000
weighted avg       0.94      0.94      0.94      1000

In [153]:
models = [best_bag_b,best_rf_b, best_ada_b, best_gbc_b]
model_names = ['Bagging','Random Forest','Ada Booster','Gradient Boosting']

plt.figure(figsize=(8, 8))

for model, name in zip(models, model_names):
    y_pred_proba = model.predict_proba(X_test)[:, 1]
    fpr, tpr, _ = roc_curve(y_test, y_pred_proba)
    roc_auc = auc(fpr, tpr)
    plt.plot(fpr, tpr, lw=2, label=f'{name} (AUC = {roc_auc:.2f})')

plt.plot([0, 1], [0, 1], linestyle='--', color='gray', label='Random Classifier')

plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc='lower right')
plt.show()
No description has been provided for this image

a.5) Support vector machines¶

Support Vector Machines (SVMs) are supervised learning models used for classification and regression tasks. SVMs can handle linear and non-linearly data by using kernel functions (e.g., linear, polynomial, radial basis function - RBF).

In this work we used the SVC class from scikit-learn to create an SVM model and tested with the kernel linear and RBF functions and used a confusion matrix to analyze the model's ability to correctly predict churn and non-churn instances. As previously performed for the Bayesian classifier we also scaled the numerical features using StandardScaler (normally distributed features) and MinMaxScaler (features that don’t follow a normal distribution).

In [154]:
X_train_balanced = X_train_balanced.loc[:, X_train_balanced.columns != 'churn']
y_train_balanced = y_train_balanced.astype(int)

scaler_standard = StandardScaler()
numeric_features = ['accountlength', 'totaldayminutes', 'totaldaycalls', 'totaldaycharge',
                    'totaleveminutes', 'totalevecalls', 'totalevecharge', 'totalnightminutes', 'totalnightcalls',
                    'totalnightcharge', 'totalintlminutes', 'totalintlcharge']  

scaler_minmax = MinMaxScaler()
numeric_features_minmax = ['numbervmailmessages', 'totalintlcalls', 'numbercustomerservicecalls']  

X_numeric_train_balanced = X_train_balanced[numeric_features] 
X_numeric_train_balanced_scaled = scaler_standard.fit_transform(X_numeric_train_balanced)  
X_train_balanced_scaled = X_train_balanced.copy()
X_train_balanced_scaled[numeric_features] = X_numeric_train_balanced_scaled  

X_numeric_minmax = X_train_balanced_scaled[numeric_features_minmax]  
X_numeric_minmax_scaled = scaler_minmax.fit_transform(X_numeric_minmax)  
X_train_balanced_scaled[numeric_features_minmax] = X_numeric_minmax_scaled  
In [156]:
# Linear Kernel

svm_model_lb = SVC(kernel='linear', C=100)
svm_model_lb.fit(X_train_balanced_scaled, y_train_balanced)

train_preds = svm_model_lb.predict(X_train_balanced_scaled)
X_train_balanced_accuracy = accuracy_score(y_train_balanced, train_preds)
print(f"Accuracy on X_train_balanced data: {X_train_balanced_accuracy:.2f}")

print("Classification Report for X_train_balanced data:")
print(classification_report(y_train_balanced, train_preds, zero_division=0))

correct_predictions = (train_preds == y_train_balanced)
num_correct = sum(correct_predictions)
print(f"Number of correct predictions: {num_correct}")

train_conf_matrix = confusion_matrix(y_train_balanced, train_preds)

plt.figure(figsize=(4, 4))
sns.heatmap(train_conf_matrix, annot=True, fmt='g', cmap='Blues', 
             square=True,
            xticklabels=['No', 'Yes'], 
            yticklabels=['No', 'Yes'])
plt.xlabel('Predicted labels')
plt.ylabel('True labels')
plt.title('Confusion Matrix of SMV Linear Kernel')
plt.show()
Accuracy on X_train_balanced data: 0.80
Classification Report for X_train_balanced data:
              precision    recall  f1-score   support

           0       0.82      0.76      0.79      1719
           1       0.78      0.84      0.80      1719

    accuracy                           0.80      3438
   macro avg       0.80      0.80      0.80      3438
weighted avg       0.80      0.80      0.80      3438

Number of correct predictions: 2741
No description has been provided for this image

The model attained a reasonable accuracy of 79% on the balanced training data. It displays an 80% precision in identifying class not churned, but with a 78% recall, indicating it might miss some actual instances of this class. For class churned, precision stands at 77%, with a slightly better recall of 81%. The f1-scores, which balance precision and recall, hover around 78% and 79% for classes not churned and churn, respectively. This suggests a balanced model performance, with nearly equal predictive accuracies across both classes.

In [158]:
# RBF Kernel
svm_model_rbf_b = SVC(kernel='rbf', C=100)
svm_model_rbf_b.fit(X_train_balanced_scaled, y_train_balanced)

train_preds = svm_model_rbf_b.predict(X_train_balanced_scaled)
X_train_balanced_accuracy = accuracy_score(y_train_balanced, train_preds)
print(f"Accuracy on X_train_balanced data: {X_train_balanced_accuracy:.2f}")

print("Classification Report for X_train_balanced data:")
print(classification_report(y_train_balanced, train_preds, zero_division=0))

correct_predictions = (train_preds == y_train_balanced)
num_correct = sum(correct_predictions)
print(f"Number of correct predictions: {num_correct}")

train_conf_matrix = confusion_matrix(y_train_balanced, train_preds)

plt.figure(figsize=(4, 4))
sns.heatmap(train_conf_matrix, annot=True, fmt='g', cmap='Blues', 
             square=True,
            xticklabels=['No', 'Yes'], 
            yticklabels=['No', 'Yes'])
plt.xlabel('Predicted labels')
plt.ylabel('True labels')
plt.title('Confusion Matrix of SMV RBF Kernel')
plt.show()
Accuracy on X_train_balanced data: 1.00
Classification Report for X_train_balanced data:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      1719
           1       1.00      1.00      1.00      1719

    accuracy                           1.00      3438
   macro avg       1.00      1.00      1.00      3438
weighted avg       1.00      1.00      1.00      3438

Number of correct predictions: 3426
No description has been provided for this image

The model displays an exceptional accuracy of 100% on the balanced training data, reflecting a very high correctness in predictions across both classes. This result is corroborated by the precision, recall, and f1-scores of 100% for both classes, indicating an balanced performance for the given dataset.

In [160]:
cv = KFold(n_splits=5, shuffle=True, random_state=42)

svm_model_rbf_b = SVC(kernel='rbf', C=100)
scores_rbf_b = cross_val_score(svm_model_rbf_b, X_train_balanced_scaled, y_train_balanced, cv=cv, scoring='accuracy')

print("Cross-validation scores for RBF Kernel:", scores_rbf_b)
print("Mean accuracy for RBF Kernel:", np.mean(scores_rbf_b))

svm_model_lb = SVC(kernel='linear', C=100)
scores_lb = cross_val_score(svm_model_lb, X_train_balanced_scaled, y_train_balanced, cv=cv, scoring='accuracy')

print("Cross-validation scores for Linear Kernel:", scores_lb)
print("Mean accuracy for Linear Kernel:", np.mean(scores_lb))
Cross-validation scores for RBF Kernel: [0.90988372 0.8997093  0.90261628 0.89665211 0.88500728]
Mean accuracy for RBF Kernel: 0.898773738194374
Cross-validation scores for Linear Kernel: [0.80377907 0.79215116 0.77180233 0.80494905 0.79039301]
Mean accuracy for Linear Kernel: 0.7926149250194644

These results suggest that the RBF kernel performs significantly better on average compared to the Linear kernel, as indicated by the higher mean accuracy observed of the RBF kernel.

a.6) Neural Network Classifier¶

To create a Neural Network Classifier, we started by creating a model with no defined parameters. We fit this model to our training data and used it to make predictions of our target variable (y_pred) using our test variables (X_test).

In [ ]:
clf = MLPClassifier().fit(X_train_balanced, y_train_balanced)

y_pred = clf.predict(X_test)
In [162]:
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
    
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)
Accuracy: 0.679
Precision: 0.28743961352657005
Recall: 0.8206896551724138
F1 Score: 0.42576028622540246

This model has an accuracy of 0.856, and the precision and recall of this model are low, which in turn gives us a low f1 score.

We created a hyperparameter grid to search for the best hidden layer sizes. We tested a few options of hidden layer sizes to demostrate how the search grid works, however, more options can be tested. The more options we test, the more likelyhood of getting a higher performing model, but it will take a longer time to run. The performance of the models will be evaluated using accuracy.

In [165]:
mlp_gs = MLPClassifier(max_iter=100)
parameter_space = {
    'hidden_layer_sizes': [(50,), (100,), (200,), (10, 20), (10, 50, 10)],
    'activation': ['tanh', 'relu'],
    'solver': ['sgd', 'adam'],
    'alpha': [0.0001, 0.05],
    'learning_rate': ['constant','adaptive'],
}

clf = GridSearchCV(mlp_gs, parameter_space, n_jobs=-1, cv=5)
clf.fit(X_train_balanced, y_train_balanced) 
C:\Users\Manuela\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:691: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (100) reached and the optimization hasn't converged yet.
  warnings.warn(
Out[165]:
GridSearchCV(cv=5, estimator=MLPClassifier(max_iter=100), n_jobs=-1,
             param_grid={'activation': ['tanh', 'relu'],
                         'alpha': [0.0001, 0.05],
                         'hidden_layer_sizes': [(50,), (100,), (200,), (10, 20),
                                                (10, 50, 10)],
                         'learning_rate': ['constant', 'adaptive'],
                         'solver': ['sgd', 'adam']})
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=5, estimator=MLPClassifier(max_iter=100), n_jobs=-1,
             param_grid={'activation': ['tanh', 'relu'],
                         'alpha': [0.0001, 0.05],
                         'hidden_layer_sizes': [(50,), (100,), (200,), (10, 20),
                                                (10, 50, 10)],
                         'learning_rate': ['constant', 'adaptive'],
                         'solver': ['sgd', 'adam']})
MLPClassifier(max_iter=100)
MLPClassifier(max_iter=100)
In [167]:
print('Best parameters found:\n', clf.best_params_)
Best parameters found:
 {'activation': 'tanh', 'alpha': 0.05, 'hidden_layer_sizes': (200,), 'learning_rate': 'adaptive', 'solver': 'adam'}
In [168]:
means = clf.cv_results_['mean_test_score']
stds = clf.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, clf.cv_results_['params']):
    print("%0.3f (+/-%0.03f) for %r" % (mean, std * 2, params))
0.639 (+/-0.031) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (50,), 'learning_rate': 'constant', 'solver': 'sgd'}
0.771 (+/-0.022) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (50,), 'learning_rate': 'constant', 'solver': 'adam'}
0.652 (+/-0.040) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (50,), 'learning_rate': 'adaptive', 'solver': 'sgd'}
0.763 (+/-0.024) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (50,), 'learning_rate': 'adaptive', 'solver': 'adam'}
0.668 (+/-0.029) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (100,), 'learning_rate': 'constant', 'solver': 'sgd'}
0.784 (+/-0.054) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (100,), 'learning_rate': 'constant', 'solver': 'adam'}
0.686 (+/-0.062) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (100,), 'learning_rate': 'adaptive', 'solver': 'sgd'}
0.785 (+/-0.045) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (100,), 'learning_rate': 'adaptive', 'solver': 'adam'}
0.689 (+/-0.040) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (200,), 'learning_rate': 'constant', 'solver': 'sgd'}
0.807 (+/-0.056) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (200,), 'learning_rate': 'constant', 'solver': 'adam'}
0.708 (+/-0.033) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (200,), 'learning_rate': 'adaptive', 'solver': 'sgd'}
0.812 (+/-0.058) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (200,), 'learning_rate': 'adaptive', 'solver': 'adam'}
0.603 (+/-0.069) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (10, 20), 'learning_rate': 'constant', 'solver': 'sgd'}
0.619 (+/-0.030) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (10, 20), 'learning_rate': 'constant', 'solver': 'adam'}
0.609 (+/-0.116) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (10, 20), 'learning_rate': 'adaptive', 'solver': 'sgd'}
0.617 (+/-0.125) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (10, 20), 'learning_rate': 'adaptive', 'solver': 'adam'}
0.604 (+/-0.052) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (10, 50, 10), 'learning_rate': 'constant', 'solver': 'sgd'}
0.621 (+/-0.043) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (10, 50, 10), 'learning_rate': 'constant', 'solver': 'adam'}
0.609 (+/-0.028) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (10, 50, 10), 'learning_rate': 'adaptive', 'solver': 'sgd'}
0.646 (+/-0.083) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (10, 50, 10), 'learning_rate': 'adaptive', 'solver': 'adam'}
0.646 (+/-0.053) for {'activation': 'tanh', 'alpha': 0.05, 'hidden_layer_sizes': (50,), 'learning_rate': 'constant', 'solver': 'sgd'}
0.764 (+/-0.051) for {'activation': 'tanh', 'alpha': 0.05, 'hidden_layer_sizes': (50,), 'learning_rate': 'constant', 'solver': 'adam'}
0.641 (+/-0.043) for {'activation': 'tanh', 'alpha': 0.05, 'hidden_layer_sizes': (50,), 'learning_rate': 'adaptive', 'solver': 'sgd'}
0.749 (+/-0.052) for {'activation': 'tanh', 'alpha': 0.05, 'hidden_layer_sizes': (50,), 'learning_rate': 'adaptive', 'solver': 'adam'}
0.667 (+/-0.032) for {'activation': 'tanh', 'alpha': 0.05, 'hidden_layer_sizes': (100,), 'learning_rate': 'constant', 'solver': 'sgd'}
0.794 (+/-0.038) for {'activation': 'tanh', 'alpha': 0.05, 'hidden_layer_sizes': (100,), 'learning_rate': 'constant', 'solver': 'adam'}
0.667 (+/-0.043) for {'activation': 'tanh', 'alpha': 0.05, 'hidden_layer_sizes': (100,), 'learning_rate': 'adaptive', 'solver': 'sgd'}
0.807 (+/-0.040) for {'activation': 'tanh', 'alpha': 0.05, 'hidden_layer_sizes': (100,), 'learning_rate': 'adaptive', 'solver': 'adam'}
0.698 (+/-0.037) for {'activation': 'tanh', 'alpha': 0.05, 'hidden_layer_sizes': (200,), 'learning_rate': 'constant', 'solver': 'sgd'}
0.816 (+/-0.049) for {'activation': 'tanh', 'alpha': 0.05, 'hidden_layer_sizes': (200,), 'learning_rate': 'constant', 'solver': 'adam'}
0.697 (+/-0.048) for {'activation': 'tanh', 'alpha': 0.05, 'hidden_layer_sizes': (200,), 'learning_rate': 'adaptive', 'solver': 'sgd'}
0.817 (+/-0.050) for {'activation': 'tanh', 'alpha': 0.05, 'hidden_layer_sizes': (200,), 'learning_rate': 'adaptive', 'solver': 'adam'}
0.538 (+/-0.066) for {'activation': 'tanh', 'alpha': 0.05, 'hidden_layer_sizes': (10, 20), 'learning_rate': 'constant', 'solver': 'sgd'}
0.678 (+/-0.123) for {'activation': 'tanh', 'alpha': 0.05, 'hidden_layer_sizes': (10, 20), 'learning_rate': 'constant', 'solver': 'adam'}
0.593 (+/-0.096) for {'activation': 'tanh', 'alpha': 0.05, 'hidden_layer_sizes': (10, 20), 'learning_rate': 'adaptive', 'solver': 'sgd'}
0.660 (+/-0.102) for {'activation': 'tanh', 'alpha': 0.05, 'hidden_layer_sizes': (10, 20), 'learning_rate': 'adaptive', 'solver': 'adam'}
0.619 (+/-0.067) for {'activation': 'tanh', 'alpha': 0.05, 'hidden_layer_sizes': (10, 50, 10), 'learning_rate': 'constant', 'solver': 'sgd'}
0.651 (+/-0.035) for {'activation': 'tanh', 'alpha': 0.05, 'hidden_layer_sizes': (10, 50, 10), 'learning_rate': 'constant', 'solver': 'adam'}
0.614 (+/-0.088) for {'activation': 'tanh', 'alpha': 0.05, 'hidden_layer_sizes': (10, 50, 10), 'learning_rate': 'adaptive', 'solver': 'sgd'}
0.600 (+/-0.082) for {'activation': 'tanh', 'alpha': 0.05, 'hidden_layer_sizes': (10, 50, 10), 'learning_rate': 'adaptive', 'solver': 'adam'}
0.669 (+/-0.028) for {'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': (50,), 'learning_rate': 'constant', 'solver': 'sgd'}
0.784 (+/-0.057) for {'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': (50,), 'learning_rate': 'constant', 'solver': 'adam'}
0.654 (+/-0.040) for {'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': (50,), 'learning_rate': 'adaptive', 'solver': 'sgd'}
0.771 (+/-0.071) for {'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': (50,), 'learning_rate': 'adaptive', 'solver': 'adam'}
0.672 (+/-0.012) for {'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': (100,), 'learning_rate': 'constant', 'solver': 'sgd'}
0.775 (+/-0.074) for {'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': (100,), 'learning_rate': 'constant', 'solver': 'adam'}
0.676 (+/-0.032) for {'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': (100,), 'learning_rate': 'adaptive', 'solver': 'sgd'}
0.772 (+/-0.105) for {'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': (100,), 'learning_rate': 'adaptive', 'solver': 'adam'}
0.691 (+/-0.042) for {'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': (200,), 'learning_rate': 'constant', 'solver': 'sgd'}
0.805 (+/-0.053) for {'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': (200,), 'learning_rate': 'constant', 'solver': 'adam'}
0.692 (+/-0.042) for {'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': (200,), 'learning_rate': 'adaptive', 'solver': 'sgd'}
0.788 (+/-0.068) for {'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': (200,), 'learning_rate': 'adaptive', 'solver': 'adam'}
0.628 (+/-0.127) for {'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': (10, 20), 'learning_rate': 'constant', 'solver': 'sgd'}
0.645 (+/-0.079) for {'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': (10, 20), 'learning_rate': 'constant', 'solver': 'adam'}
0.572 (+/-0.138) for {'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': (10, 20), 'learning_rate': 'adaptive', 'solver': 'sgd'}
0.715 (+/-0.076) for {'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': (10, 20), 'learning_rate': 'adaptive', 'solver': 'adam'}
0.627 (+/-0.131) for {'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': (10, 50, 10), 'learning_rate': 'constant', 'solver': 'sgd'}
0.691 (+/-0.063) for {'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': (10, 50, 10), 'learning_rate': 'constant', 'solver': 'adam'}
0.621 (+/-0.119) for {'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': (10, 50, 10), 'learning_rate': 'adaptive', 'solver': 'sgd'}
0.629 (+/-0.148) for {'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': (10, 50, 10), 'learning_rate': 'adaptive', 'solver': 'adam'}
0.650 (+/-0.060) for {'activation': 'relu', 'alpha': 0.05, 'hidden_layer_sizes': (50,), 'learning_rate': 'constant', 'solver': 'sgd'}
0.769 (+/-0.034) for {'activation': 'relu', 'alpha': 0.05, 'hidden_layer_sizes': (50,), 'learning_rate': 'constant', 'solver': 'adam'}
0.665 (+/-0.030) for {'activation': 'relu', 'alpha': 0.05, 'hidden_layer_sizes': (50,), 'learning_rate': 'adaptive', 'solver': 'sgd'}
0.766 (+/-0.028) for {'activation': 'relu', 'alpha': 0.05, 'hidden_layer_sizes': (50,), 'learning_rate': 'adaptive', 'solver': 'adam'}
0.671 (+/-0.052) for {'activation': 'relu', 'alpha': 0.05, 'hidden_layer_sizes': (100,), 'learning_rate': 'constant', 'solver': 'sgd'}
0.780 (+/-0.064) for {'activation': 'relu', 'alpha': 0.05, 'hidden_layer_sizes': (100,), 'learning_rate': 'constant', 'solver': 'adam'}
0.675 (+/-0.053) for {'activation': 'relu', 'alpha': 0.05, 'hidden_layer_sizes': (100,), 'learning_rate': 'adaptive', 'solver': 'sgd'}
0.776 (+/-0.056) for {'activation': 'relu', 'alpha': 0.05, 'hidden_layer_sizes': (100,), 'learning_rate': 'adaptive', 'solver': 'adam'}
0.691 (+/-0.041) for {'activation': 'relu', 'alpha': 0.05, 'hidden_layer_sizes': (200,), 'learning_rate': 'constant', 'solver': 'sgd'}
0.808 (+/-0.057) for {'activation': 'relu', 'alpha': 0.05, 'hidden_layer_sizes': (200,), 'learning_rate': 'constant', 'solver': 'adam'}
0.696 (+/-0.051) for {'activation': 'relu', 'alpha': 0.05, 'hidden_layer_sizes': (200,), 'learning_rate': 'adaptive', 'solver': 'sgd'}
0.783 (+/-0.048) for {'activation': 'relu', 'alpha': 0.05, 'hidden_layer_sizes': (200,), 'learning_rate': 'adaptive', 'solver': 'adam'}
0.631 (+/-0.082) for {'activation': 'relu', 'alpha': 0.05, 'hidden_layer_sizes': (10, 20), 'learning_rate': 'constant', 'solver': 'sgd'}
0.700 (+/-0.097) for {'activation': 'relu', 'alpha': 0.05, 'hidden_layer_sizes': (10, 20), 'learning_rate': 'constant', 'solver': 'adam'}
0.543 (+/-0.122) for {'activation': 'relu', 'alpha': 0.05, 'hidden_layer_sizes': (10, 20), 'learning_rate': 'adaptive', 'solver': 'sgd'}
0.685 (+/-0.099) for {'activation': 'relu', 'alpha': 0.05, 'hidden_layer_sizes': (10, 20), 'learning_rate': 'adaptive', 'solver': 'adam'}
0.600 (+/-0.154) for {'activation': 'relu', 'alpha': 0.05, 'hidden_layer_sizes': (10, 50, 10), 'learning_rate': 'constant', 'solver': 'sgd'}
0.721 (+/-0.060) for {'activation': 'relu', 'alpha': 0.05, 'hidden_layer_sizes': (10, 50, 10), 'learning_rate': 'constant', 'solver': 'adam'}
0.656 (+/-0.041) for {'activation': 'relu', 'alpha': 0.05, 'hidden_layer_sizes': (10, 50, 10), 'learning_rate': 'adaptive', 'solver': 'sgd'}
0.702 (+/-0.125) for {'activation': 'relu', 'alpha': 0.05, 'hidden_layer_sizes': (10, 50, 10), 'learning_rate': 'adaptive', 'solver': 'adam'}
In [170]:
clf_b = MLPClassifier(activation= 'tanh', alpha = 0.05, hidden_layer_sizes = (200,), learning_rate = 'adaptive', solver = 'adam').fit(X_train_balanced, y_train_balanced)


y_pred = clf_b.predict(X_test)
In [172]:
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
    
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)
Accuracy: 0.795
Precision: 0.38549618320610685
Recall: 0.696551724137931
F1 Score: 0.4963144963144963

As we can see, creating a hyperparameter grid allowed us to select the best hidden layer sizes amongst the ones we tested. The new model has an accuracy of 0.814, which is similar than our first models’ accuracy, 0.856, which had no defined parameters. The recall and precision, as well as the f1 score, have similar values.

In [173]:
print("Classification Report for X_train_balanced data:")
print(classification_report(y_test, y_pred, zero_division=0))

correct_predictions = (y_pred == y_test)
num_correct = sum(correct_predictions)
print(f"Number of correct predictions: {num_correct}")
Classification Report for X_train_balanced data:
              precision    recall  f1-score   support

         0.0       0.94      0.81      0.87       855
         1.0       0.39      0.70      0.50       145

    accuracy                           0.80      1000
   macro avg       0.66      0.75      0.68      1000
weighted avg       0.86      0.80      0.82      1000

Number of correct predictions: 795
In [174]:
NNC_conf_matrix = confusion_matrix(y_test, y_pred)

plt.figure(figsize=(4, 4))
sns.heatmap(KNN_conf_matrix, annot=True, fmt='g', cmap='Blues', 
            square=True,
            xticklabels=['No', 'Yes'], 
            yticklabels=['No', 'Yes'])
plt.xlabel('Predicted labels')
plt.ylabel('True labels')
plt.title('Confusion Matrix of NNC')
plt.show()
No description has been provided for this image

b) Modelling with the unbalanced dataset¶

b.1) Nearest neighbor¶

In [215]:
scaler = StandardScaler()
X_train_scaler = scaler.fit_transform(X_train)
X_test_scaler = scaler.transform(X_test)

To start training the model we used a fixed value of k=2, which will be later optimized.

In [218]:
knn = KNeighborsClassifier(n_neighbors=2)
knn.fit(X_train_scaler, y_train)
Out[218]:
KNeighborsClassifier(n_neighbors=2)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KNeighborsClassifier(n_neighbors=2)
In [219]:
y_pred = knn.predict(X_test_scaler)
In [220]:
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
Accuracy: 0.88

We may be able to get a better accuracy value by optimizing our value of k. In order to do this, we loop through many different k values.

In [221]:
k_values = [i for i in range (1,30)]
scores = []

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train_scaler)

for k in k_values:
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train_scaler, y_train)
    y_pred = knn.predict(X_test_scaler)
    score = accuracy_score(y_test, y_pred)
    scores.append(np.mean(score))
In [222]:
sns.lineplot(x = k_values, y = scores, marker = 'o')
plt.xlabel("K Values")
plt.ylabel("Accuracy Score")
Out[222]:
Text(0, 0.5, 'Accuracy Score')
No description has been provided for this image
In [224]:
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train_scaler, y_train)

y_pred = knn.predict(X_test_scaler)

accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
Accuracy: 0.898

We can see from our chart that k = 3 and k = 7 appear to have the highest accuracy scores. After calculating their individual accuracies, K = 3 has a slightly highest value, so this is the value we will use.

These are the accuracy, precision, and recall scores for our model, using the best k score of 3:

In [226]:
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train_scaler, y_train)

y_pred = knn.predict(X_test_scaler)

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
    
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)
Accuracy: 0.898
Precision: 0.7721518987341772
Recall: 0.4206896551724138
F1 Score: 0.5446428571428571

As we can see, the accuracy value is higher using k=3 than it was when we initially used k=5. Although the accuracy of this model is considerably high, precision and recall are not, with recall being under 50%, which is reflected on a low f1 score.

In [227]:
KNN_conf_matrix = confusion_matrix(y_test, y_pred)

plt.figure(figsize=(4, 4))
sns.heatmap(KNN_conf_matrix, annot=True, fmt='g', cmap='Blues', 
            square=True,
            xticklabels=['No', 'Yes'], 
            yticklabels=['No', 'Yes'])
plt.xlabel('Predicted labels')
plt.ylabel('True labels')
plt.title('Confusion Matrix of KNN')
plt.show()
No description has been provided for this image

b.2) Bayesian classifier¶

In [228]:
X_train = train_data.loc[:, train_data.columns != 'churn']
y_train = train_data['churn'].astype(int)

X_test = test_data.loc[:, test_data.columns != 'churn']
y_test = test_data['churn'].astype(int)

scaler_standard = StandardScaler()
numeric_features = ['accountlength', 'totaldayminutes', 'totaldaycalls', 'totaldaycharge',
                    'totaleveminutes', 'totalevecalls', 'totalevecharge', 'totalnightminutes', 'totalnightcalls',
                    'totalnightcharge', 'totalintlminutes', 'totalintlcharge']  

scaler_minmax = MinMaxScaler()
numeric_features_minmax = ['numbervmailmessages', 'totalintlcalls', 'numbercustomerservicecalls']  

X_numeric_train = X_train[numeric_features] 
X_numeric_train_scaled = scaler_standard.fit_transform(X_numeric_train)  
X_train_scaled = X_train.copy()
X_train_scaled[numeric_features] = X_numeric_train_scaled  

X_numeric_minmax = X_train_scaled[numeric_features_minmax]  
X_numeric_minmax_scaled = scaler_minmax.fit_transform(X_numeric_minmax)  
X_train_scaled[numeric_features_minmax] = X_numeric_minmax_scaled  

X_numeric_test = X_test[numeric_features] 
X_numeric_test_scaled = scaler_standard.transform(X_numeric_test)  
X_test_scaled = X_test.copy()
X_test_scaled[numeric_features] = X_numeric_test_scaled  

X_numeric_test_minmax = X_test_scaled[numeric_features_minmax]  
X_numeric_test_minmax_scaled = scaler_minmax.transform(X_numeric_test_minmax)  
X_test_scaled[numeric_features_minmax] = X_numeric_test_minmax_scaled  

model_NB = MixedNB()

model_NB.fit(X_train_scaled, y_train)  

preds = model_NB.predict(X_train_scaled)

accuracy = accuracy_score(y_train, preds)
print(f"Accuracy on train data: {accuracy:.2f}")

print("Classification Report for train data:")
print(classification_report(y_train, preds, zero_division=0))

correct_predictions = (preds == y_train)
num_correct = sum(correct_predictions)
print(f"Number of correct predictions: {num_correct}")

conf_matrix = confusion_matrix(y_train, preds)

plt.figure(figsize=(4, 4))
sns.heatmap(conf_matrix, annot=True, fmt='g', cmap='Blues', 
            square=True,
            xticklabels=['No', 'Yes'], 
            yticklabels=['No', 'Yes'])
plt.xlabel('Predicted labels')
plt.ylabel('True labels')
plt.show()

preds = model_NB.predict(X_test_scaled)

accuracy = accuracy_score(y_test, preds)
print(f"Accuracy on test data: {accuracy:.2f}")

print("Classification Report for test data:")
print(classification_report(y_test, preds, zero_division=0))

correct_predictions = (preds == y_test)
num_correct = sum(correct_predictions)
print(f"Number of correct predictions: {num_correct}")

conf_matrix = confusion_matrix(y_test, preds)

plt.figure(figsize=(4, 4))
sns.heatmap(conf_matrix, annot=True, fmt='g', cmap='Blues', 
            square=True,
            xticklabels=['No', 'Yes'], 
            yticklabels=['No', 'Yes'])
plt.xlabel('Predicted labels')
plt.ylabel('True labels')
plt.title('Confusion Matrix of NB')
plt.show()
Accuracy on train data: 0.88
Classification Report for train data:
              precision    recall  f1-score   support

           0       0.92      0.94      0.93      3438
           1       0.58      0.53      0.55       562

    accuracy                           0.88      4000
   macro avg       0.75      0.73      0.74      4000
weighted avg       0.88      0.88      0.88      4000

Number of correct predictions: 3520
No description has been provided for this image
Accuracy on test data: 0.87
Classification Report for test data:
              precision    recall  f1-score   support

           0       0.92      0.93      0.92       855
           1       0.56      0.52      0.54       145

    accuracy                           0.87      1000
   macro avg       0.74      0.73      0.73      1000
weighted avg       0.87      0.87      0.87      1000

Number of correct predictions: 871
No description has been provided for this image

The results obtained seem to reflect that the dataset is class imbalanced, where the number of samples for the churned class is considerably smaller than the not churned class, probably affecting the classifier's ability to predict the minority class accurately.

As a result, the model seems to perform better in predicting the not churned class than the churned class for both precision and recall, indicating that the model has more difficulty in correctly identifying churn cases.

Nonetheless, the model seems to generalize reasonably well from training to test data, with a relatively consistent drop in performance but still maintaining similar values of accuracy, precision, recall, and f1-scores on the test set as compared to the training set.

b.3) Decision Tree¶

In [48]:
model = DecisionTreeClassifier(random_state = 0)

param_grid = {
    'max_depth': [None,4,8,14,16,20,22],
    'min_samples_split': [15,20,25,30,],
    'ccp_alpha':[0.0001,0.001,0.01]
}

grid_search = GridSearchCV(model, param_grid, cv=5)
grid_search.fit(X_train, y_train)
Out[48]:
GridSearchCV(cv=5, estimator=DecisionTreeClassifier(random_state=0),
             param_grid={'ccp_alpha': [0.0001, 0.001, 0.01],
                         'max_depth': [None, 4, 8, 14, 16, 20, 22],
                         'min_samples_split': [15, 20, 25, 30]})
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=5, estimator=DecisionTreeClassifier(random_state=0),
             param_grid={'ccp_alpha': [0.0001, 0.001, 0.01],
                         'max_depth': [None, 4, 8, 14, 16, 20, 22],
                         'min_samples_split': [15, 20, 25, 30]})
DecisionTreeClassifier(random_state=0)
DecisionTreeClassifier(random_state=0)
In [49]:
best_dt_model = grid_search.best_estimator_
best_dt_model
Out[49]:
DecisionTreeClassifier(ccp_alpha=0.001, max_depth=16, min_samples_split=25,
                       random_state=0)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(ccp_alpha=0.001, max_depth=16, min_samples_split=25,
                       random_state=0)
In [238]:
plt.subplots(nrows = 1,ncols = 1,figsize = (14,14), dpi=200)
tree.plot_tree(best_dt_model, fontsize=5, filled=True)
plt.title('Decision Tree with the best parameters defined from a Cross Validation Method')
plt.show()
No description has been provided for this image
In [50]:
def scores(model):
    preds = model.predict(X_test)
    print("Classification Report for test data:")
    print(classification_report(y_test, preds, zero_division=0))
scores(best_dt_model)
Classification Report for test data:
              precision    recall  f1-score   support

         0.0       0.95      0.99      0.97       855
         1.0       0.90      0.67      0.77       145

    accuracy                           0.94      1000
   macro avg       0.92      0.83      0.87      1000
weighted avg       0.94      0.94      0.94      1000

In [245]:
preds = best_dt_model.predict(X_test)
cm = confusion_matrix(y_test, preds)

plt.figure(figsize=(4, 4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar=True)
plt.xlabel('Predicted Labels')
plt.ylabel('True Labels')
plt.title('Confusion Matrix of Decision Tree Model')
plt.show()
No description has been provided for this image
In [51]:
model_f1 = DecisionTreeClassifier(random_state = 0)

grid_search_f1 = GridSearchCV(model_f1, param_grid, scoring='f1', cv=5)
grid_search_f1.fit(X_train, y_train)
Out[51]:
GridSearchCV(cv=5, estimator=DecisionTreeClassifier(random_state=0),
             param_grid={'ccp_alpha': [0.0001, 0.001, 0.01],
                         'max_depth': [None, 4, 8, 14, 16, 20, 22],
                         'min_samples_split': [15, 20, 25, 30]},
             scoring='f1')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=5, estimator=DecisionTreeClassifier(random_state=0),
             param_grid={'ccp_alpha': [0.0001, 0.001, 0.01],
                         'max_depth': [None, 4, 8, 14, 16, 20, 22],
                         'min_samples_split': [15, 20, 25, 30]},
             scoring='f1')
DecisionTreeClassifier(random_state=0)
DecisionTreeClassifier(random_state=0)
In [52]:
best_dt_model_f1 = grid_search_f1.best_estimator_

best_dt_model_f1
Out[52]:
DecisionTreeClassifier(ccp_alpha=0.001, max_depth=16, min_samples_split=25,
                       random_state=0)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(ccp_alpha=0.001, max_depth=16, min_samples_split=25,
                       random_state=0)
In [53]:
scores(best_dt_model_f1)
Classification Report for test data:
              precision    recall  f1-score   support

         0.0       0.95      0.99      0.97       855
         1.0       0.90      0.67      0.77       145

    accuracy                           0.94      1000
   macro avg       0.92      0.83      0.87      1000
weighted avg       0.94      0.94      0.94      1000

In [54]:
preds = best_dt_model_f1.predict(X_test)
cm = confusion_matrix(y_test, preds)

plt.figure(figsize=(4, 4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar=True)
plt.xlabel('Predicted Labels')
plt.ylabel('True Labels')
plt.title('Confusion Matrix of the F1 Tunned Decision Tree Model')
plt.show()
No description has been provided for this image

b.4) Tree Ensembles Methods¶

In [55]:
X_train_tune, X_train_final, y_train_tune, y_train_final = train_test_split(X_train, y_train, test_size=0.8, random_state=0)
In [56]:
bag = BaggingClassifier(random_state=0)

param_grid = {
    'n_estimators': [100, 150, 200, 250]
}

grid_search = GridSearchCV(bag, param_grid, cv=3, scoring='accuracy')
grid_search.fit(X_train_tune, y_train_tune)
best_bag = grid_search.best_estimator_
In [57]:
best_bag
Out[57]:
BaggingClassifier(n_estimators=200, random_state=0)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
BaggingClassifier(n_estimators=200, random_state=0)
In [58]:
base_tree = DecisionTreeClassifier()
bag = BaggingClassifier(n_estimators=200, random_state=0)

bag.fit(X_train, y_train)
scores(bag)
Classification Report for test data:
              precision    recall  f1-score   support

         0.0       0.96      0.99      0.97       855
         1.0       0.90      0.73      0.81       145

    accuracy                           0.95      1000
   macro avg       0.93      0.86      0.89      1000
weighted avg       0.95      0.95      0.95      1000

In [59]:
rf = RandomForestClassifier(random_state=0)
param_grid = {
    'n_estimators': [100, 150, 200, 250],
    'max_features': [None, 'sqrt', 'log2']  
}

grid_search = GridSearchCV(rf, param_grid, cv=3, scoring='accuracy')
grid_search.fit(X_train_tune, y_train_tune)
best_rf = grid_search.best_estimator_
In [60]:
best_rf
Out[60]:
RandomForestClassifier(max_features=None, n_estimators=250, random_state=0)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(max_features=None, n_estimators=250, random_state=0)
In [61]:
rf = RandomForestClassifier(n_estimators=200, random_state=0)
rf.fit(X_train,y_train)
scores(rf)
Classification Report for test data:
              precision    recall  f1-score   support

         0.0       0.96      0.99      0.97       855
         1.0       0.91      0.75      0.82       145

    accuracy                           0.95      1000
   macro avg       0.93      0.87      0.90      1000
weighted avg       0.95      0.95      0.95      1000

In [62]:
model =  AdaBoostClassifier(random_state=0)
param_grid = {
    'n_estimators': [100, 150, 200, 250]
}

grid_search = GridSearchCV(model, param_grid, cv=3, scoring='accuracy')
grid_search.fit(X_train_tune, y_train_tune)
best_ada = grid_search.best_estimator_
best_ada
Out[62]:
AdaBoostClassifier(n_estimators=100, random_state=0)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
AdaBoostClassifier(n_estimators=100, random_state=0)
In [63]:
adaboosted_tree = AdaBoostClassifier(n_estimators=150, random_state=0)
adaboosted_tree.fit(X_train, y_train)
scores(adaboosted_tree)
Classification Report for test data:
              precision    recall  f1-score   support

         0.0       0.90      0.96      0.93       855
         1.0       0.61      0.37      0.46       145

    accuracy                           0.87      1000
   macro avg       0.75      0.66      0.69      1000
weighted avg       0.86      0.87      0.86      1000

In [64]:
model =  GBC(random_state=0)
param_grid = {
    'n_estimators': [100, 150, 200, 250]
}

grid_search = GridSearchCV(model, param_grid, cv=3, scoring='accuracy')
grid_search.fit(X_train_tune, y_train_tune)
best_gbc = grid_search.best_estimator_
best_params = grid_search.best_params_
best_params
Out[64]:
{'n_estimators': 150}
In [65]:
gbc_boosted_tree = GBC(n_estimators=100, random_state=0)
gbc_boosted_tree.fit(X_train, y_train)
scores(gbc_boosted_tree)
Classification Report for test data:
              precision    recall  f1-score   support

         0.0       0.96      0.99      0.97       855
         1.0       0.91      0.75      0.82       145

    accuracy                           0.95      1000
   macro avg       0.93      0.87      0.90      1000
weighted avg       0.95      0.95      0.95      1000

b.6) Support vector machines¶

In [286]:
X_train = train_data.loc[:, train_data.columns != 'churn']
y_train = train_data['churn'].astype(int)

X_test = test_data.loc[:, test_data.columns != 'churn']
y_test = test_data['churn'].astype(int)

scaler_standard = StandardScaler()
numeric_features = ['accountlength', 'totaldayminutes', 'totaldaycalls', 'totaldaycharge',
                    'totaleveminutes', 'totalevecalls', 'totalevecharge', 'totalnightminutes', 'totalnightcalls',
                    'totalnightcharge', 'totalintlminutes', 'totalintlcharge']  

scaler_minmax = MinMaxScaler()
numeric_features_minmax = ['numbervmailmessages', 'totalintlcalls', 'numbercustomerservicecalls']  

X_numeric_train = X_train[numeric_features] 
X_numeric_train_scaled = scaler_standard.fit_transform(X_numeric_train)  
X_train_scaled = X_train.copy()
X_train_scaled[numeric_features] = X_numeric_train_scaled  

X_numeric_minmax = X_train_scaled[numeric_features_minmax]  
X_numeric_minmax_scaled = scaler_minmax.fit_transform(X_numeric_minmax)  
X_train_scaled[numeric_features_minmax] = X_numeric_minmax_scaled  

X_numeric_test = X_test[numeric_features] 
X_numeric_test_scaled = scaler_standard.transform(X_numeric_test)  
X_test_scaled = X_test.copy()
X_test_scaled[numeric_features] = X_numeric_test_scaled  

X_numeric_test_minmax = X_test_scaled[numeric_features_minmax]  
X_numeric_test_minmax_scaled = scaler_minmax.transform(X_numeric_test_minmax)  
X_test_scaled[numeric_features_minmax] = X_numeric_test_minmax_scaled  
In [287]:
# Linear Kernel

svm_model_l = SVC(kernel='linear', C=100)
svm_model_l.fit(X_train_scaled, y_train)

train_preds = svm_model_l.predict(X_train_scaled)
train_accuracy = accuracy_score(y_train, train_preds)
print(f"Accuracy on train data: {train_accuracy:.2f}")

print("Classification Report for train data:")
print(classification_report(y_train, train_preds, zero_division=0))

correct_predictions = (train_preds == y_train)
num_correct = sum(correct_predictions)
print(f"Number of correct predictions: {num_correct}")

train_conf_matrix = confusion_matrix(y_train, train_preds)

plt.figure(figsize=(4, 4))
sns.heatmap(train_conf_matrix, annot=True, fmt='g', cmap='Blues', 
             square=True,
            xticklabels=['No', 'Yes'], 
            yticklabels=['No', 'Yes'])
plt.xlabel('Predicted labels')
plt.ylabel('True labels')
plt.show()

test_preds = svm_model_l.predict(X_test_scaled)
test_accuracy = accuracy_score(y_test, test_preds)
print(f"Accuracy on test data: {test_accuracy:.2f}")

print("Classification Report for test data:")
print(classification_report(y_test, test_preds, zero_division=0))

correct_predictions_test = (test_preds == y_test)
num_correct_test = sum(correct_predictions_test)
print(f"Number of correct predictions on test data: {num_correct_test}")

test_conf_matrix = confusion_matrix(y_test, test_preds)

plt.figure(figsize=(4, 4))
sns.heatmap(test_conf_matrix, annot=True, fmt='g', cmap='Blues', 
             square=True,
            xticklabels=['No', 'Yes'], 
            yticklabels=['No', 'Yes'])
plt.xlabel('Predicted labels')
plt.ylabel('True labels')
plt.title('Confusion Matrix of SMV Linear Kernel')
plt.show()
Accuracy on train data: 0.86
Classification Report for train data:
              precision    recall  f1-score   support

           0       0.86      1.00      0.92      3438
           1       0.00      0.00      0.00       562

    accuracy                           0.86      4000
   macro avg       0.43      0.50      0.46      4000
weighted avg       0.74      0.86      0.79      4000

Number of correct predictions: 3438
No description has been provided for this image
Accuracy on test data: 0.85
Classification Report for test data:
              precision    recall  f1-score   support

           0       0.85      1.00      0.92       855
           1       0.00      0.00      0.00       145

    accuracy                           0.85      1000
   macro avg       0.43      0.50      0.46      1000
weighted avg       0.73      0.85      0.79      1000

Number of correct predictions on test data: 855
No description has been provided for this image

The linear kernel SVM model achieved similar accuracy between train (86%) and test data (85%). However, the Precision, Recall and f1-scores were 0% for both datasets for class churned, indicating that the model didn't correctly predict any instances of churned customers.

This might reflect the imbalance in the dataset, leading the model to focus more on the majority class, affecting its ability to predict the minority class accurately. Additionally, this may indicate the model ineffectiveness for churn prediction, as the model trained with a linear kernel might not capture the complex relationships associated with churn and alternative algorithms to enhance the model's ability to predict churn accurately should be useded.

In [288]:
svm_model_rbf = SVC(kernel='rbf', C=100)
svm_model_rbf.fit(X_train_scaled, y_train)

train_preds = svm_model_rbf.predict(X_train_scaled)
train_accuracy = accuracy_score(y_train, train_preds)
print(f"Accuracy on train data: {train_accuracy:.2f}")

print("Classification Report for train data:")
print(classification_report(y_train, train_preds, zero_division=0))

correct_predictions = (train_preds == y_train)
num_correct = sum(correct_predictions)
print(f"Number of correct predictions: {num_correct}")

train_conf_matrix = confusion_matrix(y_train, train_preds)

plt.figure(figsize=(4, 4))
sns.heatmap(train_conf_matrix, annot=True, fmt='g', cmap='Blues', 
             square=True,
            xticklabels=['No', 'Yes'], 
            yticklabels=['No', 'Yes'])
plt.xlabel('Predicted labels')
plt.ylabel('True labels')
plt.show()

test_preds = svm_model_rbf.predict(X_test_scaled)
test_accuracy = accuracy_score(y_test, test_preds)
print(f"Accuracy on test data: {test_accuracy:.2f}")

print("Classification Report for test data:")
print(classification_report(y_test, test_preds, zero_division=0))

correct_predictions_test = (test_preds == y_test)
num_correct_test = sum(correct_predictions_test)
print(f"Number of correct predictions on test data: {num_correct_test}")

test_conf_matrix = confusion_matrix(y_test, test_preds)

plt.figure(figsize=(4, 4))
sns.heatmap(test_conf_matrix, annot=True, fmt='g', cmap='Blues', 
             square=True,
            xticklabels=['No', 'Yes'], 
            yticklabels=['No', 'Yes'])
plt.xlabel('Predicted labels')
plt.ylabel('True labels')
plt.title('Confusion Matrix of SMV RBF Kernel')
plt.show()
Accuracy on train data: 0.99
Classification Report for train data:
              precision    recall  f1-score   support

           0       0.99      1.00      1.00      3438
           1       1.00      0.94      0.97       562

    accuracy                           0.99      4000
   macro avg       0.99      0.97      0.98      4000
weighted avg       0.99      0.99      0.99      4000

Number of correct predictions: 3967
No description has been provided for this image
Accuracy on test data: 0.88
Classification Report for test data:
              precision    recall  f1-score   support

           0       0.93      0.93      0.93       855
           1       0.59      0.57      0.58       145

    accuracy                           0.88      1000
   macro avg       0.76      0.75      0.75      1000
weighted avg       0.88      0.88      0.88      1000

Number of correct predictions on test data: 880
No description has been provided for this image

The RBF kernel SVM model achieved an exceptional overall accuracy of 99%, presenting a high Precision (99% and 100% for not churned and churn, respectively), Recall (100% and 94% for not churned and churned, respectively), and f1-scores (100% and 97% for not churned and churn, respectively) for both classes.The model seems to effectively separate churned and not-churned instances, achieving near-perfect precision and recall scores.

Achieving such high accuracy on the training set might raise concerns about potential overfitting, being necessary to validate this model on a separate test dataset to assess its generalization ability.

However, when we tested the RBF kernel SVM model on the test dataset the accuracy dropped to 88%, with Precision, Recall and f1-scores dropping for the class churned, indicating that the model didn't correctly predict the minority class accurately.

b.7) Neural Network Classifier¶

To create a Neural Network Classifier, we started by creating a model with no defined parameters. We fit this model to our training data and used it to make predictions of our target variable (y_pred) using our test variables (X_test).

In [290]:
clf = MLPClassifier().fit(X_train, y_train)

y_pred = clf.predict(X_test)
In [291]:
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
    
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)
Accuracy: 0.873
Precision: 0.6730769230769231
Recall: 0.2413793103448276
F1 Score: 0.3553299492385787

This model has an accuracy of 0.865, however the precision and recall of this model are low, which in turn gives us a low f1 score.

We created a hyperparameter grid to search for the best hidden layer sizes. We tested a few options of hidden layer sizes to demostrate how the search grid works, however, more options can be tested. The more options we test, the more likelyhood of getting a higher performing model, but it will take a longer time to run. The performance of the models will be evaluated using accuracy.

In [292]:
mlp_gs = MLPClassifier(max_iter=100)
parameter_space = {
    'hidden_layer_sizes': [(50,), (100,), (200,), (10, 20), (10, 50, 10)],
    'activation': ['tanh', 'relu'],
    'solver': ['sgd', 'adam'],
    'alpha': [0.0001, 0.05],
    'learning_rate': ['constant','adaptive'],
}
from sklearn.model_selection import GridSearchCV
clf = GridSearchCV(mlp_gs, parameter_space, n_jobs=-1, cv=5)
clf.fit(X_train, y_train) 
Out[292]:
GridSearchCV(cv=5, estimator=MLPClassifier(max_iter=100), n_jobs=-1,
             param_grid={'activation': ['tanh', 'relu'],
                         'alpha': [0.0001, 0.05],
                         'hidden_layer_sizes': [(50,), (100,), (200,), (10, 20),
                                                (10, 50, 10)],
                         'learning_rate': ['constant', 'adaptive'],
                         'solver': ['sgd', 'adam']})
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=5, estimator=MLPClassifier(max_iter=100), n_jobs=-1,
             param_grid={'activation': ['tanh', 'relu'],
                         'alpha': [0.0001, 0.05],
                         'hidden_layer_sizes': [(50,), (100,), (200,), (10, 20),
                                                (10, 50, 10)],
                         'learning_rate': ['constant', 'adaptive'],
                         'solver': ['sgd', 'adam']})
MLPClassifier(max_iter=100)
MLPClassifier(max_iter=100)
In [293]:
print('Best parameters found:\n', clf.best_params_)
Best parameters found:
 {'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': (10, 50, 10), 'learning_rate': 'constant', 'solver': 'adam'}
In [294]:
means = clf.cv_results_['mean_test_score']
stds = clf.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, clf.cv_results_['params']):
    print("%0.3f (+/-%0.03f) for %r" % (mean, std * 2, params))
0.860 (+/-0.003) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (50,), 'learning_rate': 'constant', 'solver': 'sgd'}
0.862 (+/-0.009) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (50,), 'learning_rate': 'constant', 'solver': 'adam'}
0.861 (+/-0.009) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (50,), 'learning_rate': 'adaptive', 'solver': 'sgd'}
0.866 (+/-0.006) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (50,), 'learning_rate': 'adaptive', 'solver': 'adam'}
0.862 (+/-0.005) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (100,), 'learning_rate': 'constant', 'solver': 'sgd'}
0.865 (+/-0.006) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (100,), 'learning_rate': 'constant', 'solver': 'adam'}
0.860 (+/-0.005) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (100,), 'learning_rate': 'adaptive', 'solver': 'sgd'}
0.868 (+/-0.011) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (100,), 'learning_rate': 'adaptive', 'solver': 'adam'}
0.869 (+/-0.011) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (200,), 'learning_rate': 'constant', 'solver': 'sgd'}
0.868 (+/-0.006) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (200,), 'learning_rate': 'constant', 'solver': 'adam'}
0.869 (+/-0.005) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (200,), 'learning_rate': 'adaptive', 'solver': 'sgd'}
0.871 (+/-0.012) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (200,), 'learning_rate': 'adaptive', 'solver': 'adam'}
0.860 (+/-0.002) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (10, 20), 'learning_rate': 'constant', 'solver': 'sgd'}
0.860 (+/-0.002) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (10, 20), 'learning_rate': 'constant', 'solver': 'adam'}
0.860 (+/-0.002) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (10, 20), 'learning_rate': 'adaptive', 'solver': 'sgd'}
0.861 (+/-0.010) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (10, 20), 'learning_rate': 'adaptive', 'solver': 'adam'}
0.860 (+/-0.001) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (10, 50, 10), 'learning_rate': 'constant', 'solver': 'sgd'}
0.860 (+/-0.003) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (10, 50, 10), 'learning_rate': 'constant', 'solver': 'adam'}
0.859 (+/-0.002) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (10, 50, 10), 'learning_rate': 'adaptive', 'solver': 'sgd'}
0.861 (+/-0.012) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (10, 50, 10), 'learning_rate': 'adaptive', 'solver': 'adam'}
0.861 (+/-0.005) for {'activation': 'tanh', 'alpha': 0.05, 'hidden_layer_sizes': (50,), 'learning_rate': 'constant', 'solver': 'sgd'}
0.866 (+/-0.010) for {'activation': 'tanh', 'alpha': 0.05, 'hidden_layer_sizes': (50,), 'learning_rate': 'constant', 'solver': 'adam'}
0.861 (+/-0.010) for {'activation': 'tanh', 'alpha': 0.05, 'hidden_layer_sizes': (50,), 'learning_rate': 'adaptive', 'solver': 'sgd'}
0.867 (+/-0.007) for {'activation': 'tanh', 'alpha': 0.05, 'hidden_layer_sizes': (50,), 'learning_rate': 'adaptive', 'solver': 'adam'}
0.866 (+/-0.006) for {'activation': 'tanh', 'alpha': 0.05, 'hidden_layer_sizes': (100,), 'learning_rate': 'constant', 'solver': 'sgd'}
0.868 (+/-0.009) for {'activation': 'tanh', 'alpha': 0.05, 'hidden_layer_sizes': (100,), 'learning_rate': 'constant', 'solver': 'adam'}
0.863 (+/-0.011) for {'activation': 'tanh', 'alpha': 0.05, 'hidden_layer_sizes': (100,), 'learning_rate': 'adaptive', 'solver': 'sgd'}
0.868 (+/-0.019) for {'activation': 'tanh', 'alpha': 0.05, 'hidden_layer_sizes': (100,), 'learning_rate': 'adaptive', 'solver': 'adam'}
0.866 (+/-0.003) for {'activation': 'tanh', 'alpha': 0.05, 'hidden_layer_sizes': (200,), 'learning_rate': 'constant', 'solver': 'sgd'}
0.864 (+/-0.005) for {'activation': 'tanh', 'alpha': 0.05, 'hidden_layer_sizes': (200,), 'learning_rate': 'constant', 'solver': 'adam'}
0.869 (+/-0.008) for {'activation': 'tanh', 'alpha': 0.05, 'hidden_layer_sizes': (200,), 'learning_rate': 'adaptive', 'solver': 'sgd'}
0.869 (+/-0.012) for {'activation': 'tanh', 'alpha': 0.05, 'hidden_layer_sizes': (200,), 'learning_rate': 'adaptive', 'solver': 'adam'}
0.860 (+/-0.001) for {'activation': 'tanh', 'alpha': 0.05, 'hidden_layer_sizes': (10, 20), 'learning_rate': 'constant', 'solver': 'sgd'}
0.860 (+/-0.005) for {'activation': 'tanh', 'alpha': 0.05, 'hidden_layer_sizes': (10, 20), 'learning_rate': 'constant', 'solver': 'adam'}
0.859 (+/-0.001) for {'activation': 'tanh', 'alpha': 0.05, 'hidden_layer_sizes': (10, 20), 'learning_rate': 'adaptive', 'solver': 'sgd'}
0.860 (+/-0.001) for {'activation': 'tanh', 'alpha': 0.05, 'hidden_layer_sizes': (10, 20), 'learning_rate': 'adaptive', 'solver': 'adam'}
0.860 (+/-0.003) for {'activation': 'tanh', 'alpha': 0.05, 'hidden_layer_sizes': (10, 50, 10), 'learning_rate': 'constant', 'solver': 'sgd'}
0.862 (+/-0.006) for {'activation': 'tanh', 'alpha': 0.05, 'hidden_layer_sizes': (10, 50, 10), 'learning_rate': 'constant', 'solver': 'adam'}
0.860 (+/-0.001) for {'activation': 'tanh', 'alpha': 0.05, 'hidden_layer_sizes': (10, 50, 10), 'learning_rate': 'adaptive', 'solver': 'sgd'}
0.861 (+/-0.008) for {'activation': 'tanh', 'alpha': 0.05, 'hidden_layer_sizes': (10, 50, 10), 'learning_rate': 'adaptive', 'solver': 'adam'}
0.865 (+/-0.010) for {'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': (50,), 'learning_rate': 'constant', 'solver': 'sgd'}
0.856 (+/-0.017) for {'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': (50,), 'learning_rate': 'constant', 'solver': 'adam'}
0.863 (+/-0.008) for {'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': (50,), 'learning_rate': 'adaptive', 'solver': 'sgd'}
0.864 (+/-0.014) for {'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': (50,), 'learning_rate': 'adaptive', 'solver': 'adam'}
0.864 (+/-0.010) for {'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': (100,), 'learning_rate': 'constant', 'solver': 'sgd'}
0.863 (+/-0.007) for {'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': (100,), 'learning_rate': 'constant', 'solver': 'adam'}
0.866 (+/-0.007) for {'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': (100,), 'learning_rate': 'adaptive', 'solver': 'sgd'}
0.868 (+/-0.008) for {'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': (100,), 'learning_rate': 'adaptive', 'solver': 'adam'}
0.868 (+/-0.015) for {'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': (200,), 'learning_rate': 'constant', 'solver': 'sgd'}
0.857 (+/-0.042) for {'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': (200,), 'learning_rate': 'constant', 'solver': 'adam'}
0.869 (+/-0.014) for {'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': (200,), 'learning_rate': 'adaptive', 'solver': 'sgd'}
0.867 (+/-0.012) for {'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': (200,), 'learning_rate': 'adaptive', 'solver': 'adam'}
0.860 (+/-0.001) for {'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': (10, 20), 'learning_rate': 'constant', 'solver': 'sgd'}
0.865 (+/-0.010) for {'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': (10, 20), 'learning_rate': 'constant', 'solver': 'adam'}
0.859 (+/-0.001) for {'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': (10, 20), 'learning_rate': 'adaptive', 'solver': 'sgd'}
0.867 (+/-0.005) for {'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': (10, 20), 'learning_rate': 'adaptive', 'solver': 'adam'}
0.862 (+/-0.005) for {'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': (10, 50, 10), 'learning_rate': 'constant', 'solver': 'sgd'}
0.871 (+/-0.008) for {'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': (10, 50, 10), 'learning_rate': 'constant', 'solver': 'adam'}
0.860 (+/-0.002) for {'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': (10, 50, 10), 'learning_rate': 'adaptive', 'solver': 'sgd'}
0.862 (+/-0.015) for {'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': (10, 50, 10), 'learning_rate': 'adaptive', 'solver': 'adam'}
0.867 (+/-0.010) for {'activation': 'relu', 'alpha': 0.05, 'hidden_layer_sizes': (50,), 'learning_rate': 'constant', 'solver': 'sgd'}
0.854 (+/-0.018) for {'activation': 'relu', 'alpha': 0.05, 'hidden_layer_sizes': (50,), 'learning_rate': 'constant', 'solver': 'adam'}
0.859 (+/-0.005) for {'activation': 'relu', 'alpha': 0.05, 'hidden_layer_sizes': (50,), 'learning_rate': 'adaptive', 'solver': 'sgd'}
0.858 (+/-0.028) for {'activation': 'relu', 'alpha': 0.05, 'hidden_layer_sizes': (50,), 'learning_rate': 'adaptive', 'solver': 'adam'}
0.869 (+/-0.009) for {'activation': 'relu', 'alpha': 0.05, 'hidden_layer_sizes': (100,), 'learning_rate': 'constant', 'solver': 'sgd'}
0.863 (+/-0.025) for {'activation': 'relu', 'alpha': 0.05, 'hidden_layer_sizes': (100,), 'learning_rate': 'constant', 'solver': 'adam'}
0.867 (+/-0.008) for {'activation': 'relu', 'alpha': 0.05, 'hidden_layer_sizes': (100,), 'learning_rate': 'adaptive', 'solver': 'sgd'}
0.855 (+/-0.032) for {'activation': 'relu', 'alpha': 0.05, 'hidden_layer_sizes': (100,), 'learning_rate': 'adaptive', 'solver': 'adam'}
0.865 (+/-0.015) for {'activation': 'relu', 'alpha': 0.05, 'hidden_layer_sizes': (200,), 'learning_rate': 'constant', 'solver': 'sgd'}
0.866 (+/-0.015) for {'activation': 'relu', 'alpha': 0.05, 'hidden_layer_sizes': (200,), 'learning_rate': 'constant', 'solver': 'adam'}
0.869 (+/-0.010) for {'activation': 'relu', 'alpha': 0.05, 'hidden_layer_sizes': (200,), 'learning_rate': 'adaptive', 'solver': 'sgd'}
0.865 (+/-0.010) for {'activation': 'relu', 'alpha': 0.05, 'hidden_layer_sizes': (200,), 'learning_rate': 'adaptive', 'solver': 'adam'}
0.862 (+/-0.011) for {'activation': 'relu', 'alpha': 0.05, 'hidden_layer_sizes': (10, 20), 'learning_rate': 'constant', 'solver': 'sgd'}
0.862 (+/-0.008) for {'activation': 'relu', 'alpha': 0.05, 'hidden_layer_sizes': (10, 20), 'learning_rate': 'constant', 'solver': 'adam'}
0.863 (+/-0.015) for {'activation': 'relu', 'alpha': 0.05, 'hidden_layer_sizes': (10, 20), 'learning_rate': 'adaptive', 'solver': 'sgd'}
0.865 (+/-0.013) for {'activation': 'relu', 'alpha': 0.05, 'hidden_layer_sizes': (10, 20), 'learning_rate': 'adaptive', 'solver': 'adam'}
0.865 (+/-0.011) for {'activation': 'relu', 'alpha': 0.05, 'hidden_layer_sizes': (10, 50, 10), 'learning_rate': 'constant', 'solver': 'sgd'}
0.866 (+/-0.004) for {'activation': 'relu', 'alpha': 0.05, 'hidden_layer_sizes': (10, 50, 10), 'learning_rate': 'constant', 'solver': 'adam'}
0.862 (+/-0.005) for {'activation': 'relu', 'alpha': 0.05, 'hidden_layer_sizes': (10, 50, 10), 'learning_rate': 'adaptive', 'solver': 'sgd'}
0.865 (+/-0.011) for {'activation': 'relu', 'alpha': 0.05, 'hidden_layer_sizes': (10, 50, 10), 'learning_rate': 'adaptive', 'solver': 'adam'}
In [299]:
clf = MLPClassifier(activation= 'relu', alpha = 0.0001, hidden_layer_sizes = (10, 50, 10), learning_rate = 'constant', solver = 'adam').fit(X_train, y_train)


y_pred = clf_b.predict(X_test)
In [300]:
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
    
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)
Accuracy: 0.795
Precision: 0.38549618320610685
Recall: 0.696551724137931
F1 Score: 0.4963144963144963
In [301]:
print("Classification Report for X_train data:")
print(classification_report(y_test, y_pred, zero_division=0))

correct_predictions = (y_pred == y_test)
num_correct = sum(correct_predictions)
print(f"Number of correct predictions: {num_correct}")
Classification Report for X_train data:
              precision    recall  f1-score   support

           0       0.94      0.81      0.87       855
           1       0.39      0.70      0.50       145

    accuracy                           0.80      1000
   macro avg       0.66      0.75      0.68      1000
weighted avg       0.86      0.80      0.82      1000

Number of correct predictions: 795
In [302]:
NNC_conf_matrix = confusion_matrix(y_test, y_pred)

plt.figure(figsize=(4, 4))
sns.heatmap(NNC_conf_matrix, annot=True, fmt='g', cmap='Blues', 
            square=True,
            xticklabels=['No', 'Yes'], 
            yticklabels=['No', 'Yes'])
plt.xlabel('Predicted labels')
plt.ylabel('True labels')
plt.title('Confusion Matrix of NNC')
plt.show()
No description has been provided for this image

5- Evaluation and Main Conclusions¶

What is your summary of the achieved results? The primary objective of this work was to develop a robust predictive model capable of anticipating customer churn within the telecommunications sector. To tackle the problem, we tested six models: Nearest neighbor, Bayesian Classifier, Decision Trees, Tree ensembles, Support Vector Machines, Neural Network Classifier, to predict customer churn. The results obtained in our work have shown that several of these models presented a high accuracy rate in predicting customer churn, with high precision and recall. Nonetheless, the best prediction model, when trained either with a balanced or unbalanced dataset, was the Tree Ensemble. With a high accuracy score of 95%, and a F1 score of 0,82 presented the best overall performance and was the best at predicting churn.

In [66]:
models = [best_dt_model_f1_b, best_dt_model_f1, best_bag_b, bag, best_rf_b, rf, best_ada_b, adaboosted_tree, best_gbc_b, gbc_boosted_tree]
models_names = ['Decision Tree', 'Decision Tree', 'Bagging', 'Bagging', 'Random Forest', 'Random Forest', 'Ada Booster', 'Ada Booster', 'Gradient Boosting', 'Gradient Boosting']
data = ["Balanced", "Unbalanced", "Balanced", "Unbalanced", "Balanced", "Unbalanced", "Balanced", "Unbalanced", "Balanced", "Unbalanced"]
accuracies = []
recall_scores = []
precision_scores = []
f1_scores = []

for model in models:
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    accuracies.append(accuracy)
    recall_scores.append(recall)
    precision_scores.append(precision)
    f1_scores.append(f1)

comp = {'Model': models_names, 'Data': data, 'Accuracy': accuracies, 'Recall Score': recall_scores, 'Precision Score': precision_scores, 'F1 Score': f1_scores}
comparation = pd.DataFrame(comp)

comparation
Out[66]:
Model Data Accuracy Recall Score Precision Score F1 Score
0 Decision Tree Balanced 0.910 0.820690 0.650273 0.725610
1 Decision Tree Unbalanced 0.941 0.668966 0.898148 0.766798
2 Bagging Balanced 0.940 0.834483 0.770701 0.801325
3 Bagging Unbalanced 0.949 0.731034 0.898305 0.806084
4 Random Forest Balanced 0.938 0.827586 0.764331 0.794702
5 Random Forest Unbalanced 0.953 0.751724 0.908333 0.822642
6 Ada Booster Balanced 0.851 0.648276 0.489583 0.557864
7 Ada Booster Unbalanced 0.874 0.365517 0.609195 0.456897
8 Gradient Boosting Balanced 0.939 0.820690 0.772727 0.795987
9 Gradient Boosting Unbalanced 0.953 0.751724 0.908333 0.822642
In [69]:
import pandas as pd

models = ['knn_b', 'knn', 'clf_b', 'clf', 'model_NB_b', 'model_NB', 'svm_model_lb', 'svm_model_l']
models_names = ['KNN', 'KNN', 'NeuralNetwork', 'NeuralNetwork', 'Bayes', 'Bayes','SVMLinear', 'SVMLinear']
data = ["Balanced", "Unbalanced", "Balanced", "Unbalanced","Balanced", "Unbalanced", "Balanced", "Unbalanced"]
accuracies = []
recall_scores = []
precision_scores = []
f1_scores = []

for model in models:   
    if model == 'knn_b':
        accuracies.append(0.80)
        recall_scores.append(0.80)
        precision_scores.append(0.87)
        f1_scores.append(0.82)
    if model == 'knn':
        accuracies.append(0.89)
        recall_scores.append(0.42)
        precision_scores.append(0.77)
        f1_scores.append(0.54)
    if model == 'clf_b':
        accuracies.append(0.80)
        recall_scores.append(0.86)
        precision_scores.append(0.80)
        f1_scores.append(0.82)
    if model == 'clf':
        accuracies.append(0.80)
        recall_scores.append(0.80)
        precision_scores.append(0.86)
        f1_scores.append(0.82)
    if model == 'model_NB_b':
        accuracies.append(0.78)
        recall_scores.append(0.79)
        precision_scores.append(0.78)
        f1_scores.append(0.78)
    elif model == 'model_NB':
        accuracies.append(0.87)
        recall_scores.append(0.87)
        precision_scores.append(0.87 )
        f1_scores.append(0.87)
    elif model == 'svm_model_lb':
        accuracies.append(0.80)
        recall_scores.append(0.80)
        precision_scores.append(0.80)
        f1_scores.append(0.80)
    elif model == "svm_model_l":
        accuracies.append(0.85)
        recall_scores.append(0.85)
        precision_scores.append(0.73)
        f1_scores.append(0.79)

comp = {'Model': models_names, 'Data': data, 'Accuracy': accuracies, 'Recall Score': recall_scores, 'Precision Score': precision_scores, 'F1 Score': f1_scores}
comparison = pd.DataFrame(comp)

comparison
Out[69]:
Model Data Accuracy Recall Score Precision Score F1 Score
0 KNN Balanced 0.80 0.80 0.87 0.82
1 KNN Unbalanced 0.89 0.42 0.77 0.54
2 NeuralNetwork Balanced 0.80 0.86 0.80 0.82
3 NeuralNetwork Unbalanced 0.80 0.80 0.86 0.82
4 Bayes Balanced 0.78 0.79 0.78 0.78
5 Bayes Unbalanced 0.87 0.87 0.87 0.87
6 SVMLinear Balanced 0.80 0.80 0.80 0.80
7 SVMLinear Unbalanced 0.85 0.85 0.73 0.79

In order to see if the combination of the predictions of the different models would improve the results, we created two Hard Voting models (balanced and unbalanced data), aggregating the predictions of best performing models. We did this for some of the models of the balanced and unbalanced data, but the accuracy scores were very similar to the Gradient Boosting Tree Ensemble, our best model, and had lower F1 scores.

In [70]:
voting_model_balanced = VotingClassifier(estimators=[
    ('best_bag_b', best_bag_b),
    ('best_rf_b', best_rf_b),
    ('best_gbc_b', best_gbc_b)
], voting='hard')

voting_model_balanced.fit(X_train_balanced, y_train_balanced)

y_pred_balanced = voting_model_balanced.predict(X_test)

accuracy_balanced = accuracy_score(y_test, y_pred_balanced)
recall_balanced = recall_score(y_test, y_pred_balanced)
precision_balanced = precision_score(y_test, y_pred_balanced)
f1_balanced = f1_score(y_test, y_pred_balanced)

voting_model_unbalanced = VotingClassifier(estimators=[
    ('bag', bag),
    ('rf', rf),
    ('gbc_boosted_tree', gbc_boosted_tree)
], voting='hard')

voting_model_unbalanced.fit(X_train, y_train)

y_pred_unbalanced = voting_model_unbalanced.predict(X_test)

accuracy_unbalanced = accuracy_score(y_test, y_pred_unbalanced)
recall_unbalanced = recall_score(y_test, y_pred_unbalanced)
precision_unbalanced = precision_score(y_test, y_pred_unbalanced)
f1_unbalanced = f1_score(y_test, y_pred_unbalanced)

voting_comp = {
    'Data': ['Balanced', 'Unbalanced'],
    'Accuracy': [accuracy_balanced, accuracy_unbalanced],
    'Recall Score': [recall_balanced, recall_unbalanced],
    'Precision Score': [precision_balanced, precision_unbalanced],
    'F1 Score': [f1_balanced, f1_unbalanced]
}

voting_comparison = pd.DataFrame(voting_comp)

display(voting_comparison)
Data Accuracy Recall Score Precision Score F1 Score
0 Balanced 0.944 0.827586 0.794702 0.810811
1 Unbalanced 0.956 0.751724 0.931624 0.832061

In order to know if there is any particular characteristic on the data that makes them difficult to classify we have to identify cases that are recurrently mis-classified by most methods:

In [72]:
models = [best_dt_model_f1_b, best_dt_model_f1, best_bag_b, bag, best_rf_b, rf, best_ada_b, adaboosted_tree, best_gbc_b, gbc_boosted_tree]
predictions = pd.DataFrame({'True_Labels': y_test}, index=X_test.index)

for model in models:
    model_name = str(model).split('(')[0]
    predictions[model_name] = model.predict(X_test)

# Create a DataFrame for misclassified cases with the correct index
misclassified_cases = pd.DataFrame({'True_Labels': y_test}, index=X_test.index)

for model in models:
    model_name = str(model).split('(')[0]
    misclassified_cases[model_name] = (predictions['True_Labels'] != predictions[model_name]).astype(int)

misclassified_cases['Total_Misclassifications'] = misclassified_cases.iloc[:, 1:].sum(axis=1)

most_missclassified_cases = misclassified_cases[misclassified_cases['Total_Misclassifications'] > 2]

# Use boolean indexing to filter rows in X_test
most_missclassified_features = X_test.loc[most_missclassified_cases.index]
In [73]:
miss_features = test_data.loc[most_missclassified_cases.index]
miss_features
Out[73]:
churn accountlength internationalplan voicemailplan numbervmailmessages totaldayminutes totaldaycalls totaldaycharge totaleveminutes totalevecalls totalevecharge totalnightminutes totalnightcalls totalnightcharge totalintlminutes totalintlcalls totalintlcharge numbercustomerservicecalls
57 1.0 121.0 0.0 1.0 30.0 198.4 129.0 33.730000 75.30 77.0 6.40 181.2 77.0 8.150 5.80 3.0 1.570000 3.0
951 0.0 101.0 0.0 0.0 0.0 153.8 89.0 26.150000 234.00 89.0 19.89 196.3 77.0 8.830 11.60 2.0 3.130000 4.0
2369 1.0 112.0 0.0 0.0 0.0 174.3 123.0 29.630000 191.38 124.0 11.92 215.4 89.0 9.690 9.00 6.0 2.430000 4.0
3256 0.0 115.0 0.0 0.0 0.0 268.0 115.0 45.560000 153.60 106.0 13.06 232.3 65.0 10.450 17.00 1.0 4.590000 3.0
1165 0.0 50.0 1.0 1.0 26.0 307.1 94.0 52.210000 289.40 78.0 24.60 174.9 109.0 7.870 8.00 3.0 2.160000 0.0
2731 1.0 127.0 0.0 0.0 0.0 245.2 91.0 41.680000 217.20 92.0 18.46 243.1 128.0 10.940 13.90 6.0 3.750000 0.0
2745 1.0 61.0 0.0 1.0 40.0 105.0 78.0 17.850000 180.60 100.0 15.35 174.1 115.0 7.830 10.20 2.0 2.750000 2.0
3604 1.0 60.0 0.0 0.0 0.0 221.0 94.0 37.570000 308.20 72.0 26.20 233.2 94.0 10.490 10.10 5.0 2.730000 1.0
3070 1.0 154.0 0.0 0.0 0.0 154.5 122.0 26.270000 214.20 71.0 18.21 178.0 105.0 8.010 12.00 2.0 3.240000 3.0
437 1.0 100.0 0.0 0.0 0.0 278.0 76.0 47.260000 176.70 74.0 15.02 219.5 126.0 9.880 8.30 4.0 2.240000 0.0
1965 1.0 139.0 0.0 0.0 0.0 236.6 109.0 40.220000 169.90 107.0 14.44 212.3 118.0 9.550 11.10 2.0 3.000000 1.0
3873 1.0 63.0 0.0 0.0 0.0 162.5 86.0 27.630000 165.40 112.0 14.06 163.7 67.0 7.370 13.40 6.0 3.620000 4.0
2961 1.0 98.0 0.0 1.0 36.0 168.0 81.0 28.560000 163.20 125.0 13.87 172.7 120.0 7.770 8.00 2.0 2.160000 6.0
1551 1.0 225.0 0.0 0.0 0.0 165.4 106.0 28.120000 273.70 109.0 23.26 210.0 93.0 9.450 8.70 3.0 2.350000 0.0
2377 1.0 101.0 0.0 1.0 36.0 123.7 125.0 21.030000 172.60 106.0 14.67 280.5 127.0 12.620 8.80 4.0 2.380000 1.0
4153 1.0 106.0 1.0 0.0 0.0 196.7 100.0 33.440000 210.20 109.0 17.87 173.8 83.0 7.820 7.98 1.0 2.110000 4.0
3509 0.0 59.0 0.0 1.0 26.0 205.4 68.0 34.920000 115.00 115.0 9.78 214.7 130.0 9.660 9.40 2.0 2.540000 5.0
2820 0.0 74.0 0.0 1.0 27.0 154.1 122.0 26.200000 195.30 150.0 16.60 276.7 86.0 12.450 13.20 2.0 3.560000 4.0
1018 1.0 76.0 0.0 0.0 0.0 263.4 148.0 37.387971 230.30 69.0 19.58 170.6 101.0 7.680 11.40 5.0 3.080000 1.0
1904 1.0 90.8 0.0 1.0 33.0 167.8 91.0 28.530000 205.30 91.0 17.45 130.0 132.0 5.850 14.50 4.0 3.920000 4.0
1192 1.0 88.0 1.0 0.0 0.0 235.1 98.0 39.970000 251.80 79.0 21.40 285.9 76.0 12.870 7.20 2.0 1.940000 4.0
4980 1.0 73.0 0.0 0.0 0.0 177.2 86.6 30.120000 270.50 84.0 22.99 241.8 112.0 10.880 12.30 2.0 3.320000 3.0
1105 1.0 135.0 0.0 1.0 28.0 201.4 100.0 34.240000 246.50 117.0 20.95 154.8 131.0 6.970 12.90 4.0 3.480000 2.0
4421 1.0 67.0 0.0 0.0 0.0 177.4 102.0 30.160000 215.20 123.0 18.29 132.4 120.0 5.960 6.60 4.0 1.780000 4.0
69 1.0 150.0 0.0 0.0 0.0 178.9 101.0 30.410000 169.10 110.0 14.37 148.6 100.0 6.690 13.80 3.0 3.730000 4.0
2724 1.0 182.0 0.0 0.0 0.0 279.1 124.0 47.450000 180.50 108.0 15.34 217.5 104.0 9.790 9.50 11.0 2.570000 2.0
736 1.0 68.0 0.0 0.0 0.0 162.1 86.0 27.560000 155.00 86.0 13.18 189.7 87.0 8.540 11.00 9.0 2.970000 5.0
1240 1.0 108.0 0.0 1.0 34.0 162.1 83.0 27.560000 171.80 117.0 14.60 259.8 76.0 11.690 9.60 3.0 2.590000 4.0
3991 1.0 123.0 0.0 0.0 0.0 132.2 122.0 22.470000 169.90 101.0 14.44 150.1 123.0 6.750 12.90 11.0 3.480000 3.0
2785 1.0 38.0 0.0 0.0 0.0 175.7 109.0 29.870000 211.80 97.0 18.00 137.9 109.0 6.210 9.20 3.0 2.624881 5.0
2950 1.0 133.0 2.0 0.0 0.0 117.8 100.0 20.030000 199.20 105.0 16.93 244.1 119.0 10.980 11.80 4.0 3.190000 0.0
4137 1.0 76.0 0.0 0.0 0.0 222.0 99.0 37.740000 211.22 122.0 18.35 204.1 131.0 9.180 8.20 5.0 2.210000 3.0
2934 1.0 24.0 0.0 0.0 0.0 149.0 73.0 25.330000 131.00 81.0 11.14 238.6 69.0 10.606 8.60 3.0 2.320000 2.0
4406 1.0 116.0 1.0 0.0 0.0 232.2 98.0 39.470000 244.70 91.0 20.80 151.3 69.0 6.810 10.60 1.0 2.860000 7.0
1115 1.0 98.0 0.0 0.0 0.0 165.0 129.0 28.050000 202.60 113.0 17.22 172.3 94.0 7.750 12.50 7.0 3.380000 1.0
3369 1.0 73.0 0.0 0.0 0.0 137.6 113.0 23.390000 179.70 75.0 15.27 172.8 71.0 7.780 10.70 2.0 2.890000 1.0
1262 0.0 157.0 0.0 0.0 0.0 220.7 105.0 37.520000 119.30 127.0 10.14 165.1 113.0 7.430 11.50 7.0 3.110000 4.0
4624 1.0 178.0 0.0 0.0 0.0 225.4 106.0 38.320000 110.70 90.0 9.41 134.1 109.0 6.030 15.30 5.0 4.130000 2.0
1532 1.0 103.0 0.0 1.0 18.0 149.9 84.0 25.480000 170.90 84.0 14.53 171.5 112.0 7.720 11.50 7.0 3.110000 0.0
3147 0.0 62.0 0.0 0.0 0.0 245.3 91.0 41.700000 122.90 130.0 10.45 228.4 102.0 10.280 8.50 4.0 2.300000 4.0
2472 1.0 129.0 0.0 0.0 0.0 137.8 120.0 23.430000 225.80 110.0 19.19 145.2 95.0 6.530 10.20 6.0 2.750000 1.0
4494 0.0 137.0 0.0 0.0 0.0 248.3 63.0 42.210000 250.40 108.0 21.28 159.1 94.0 7.160 9.60 4.0 2.590000 1.0
2238 1.0 61.0 0.0 0.0 0.0 267.1 104.0 45.410000 180.40 131.0 15.33 230.6 106.0 10.380 12.10 4.0 4.670000 1.0
974 0.0 81.4 0.0 1.0 31.0 135.9 90.0 23.100000 271.00 84.0 23.04 179.1 89.0 8.060 9.50 7.0 2.570000 6.0
3601 1.0 166.0 0.0 0.0 0.0 156.8 100.0 26.660000 171.50 93.0 14.58 149.1 97.0 6.710 11.20 10.0 3.020000 1.0
91 1.0 155.0 0.0 0.0 0.0 203.4 100.0 34.580000 190.90 104.0 16.23 196.0 119.0 8.820 8.90 4.0 2.400000 0.0
2535 1.0 125.0 0.0 0.0 0.0 113.0 108.0 19.210000 169.20 107.0 14.38 156.6 61.0 7.050 9.20 5.0 2.480000 2.0
1516 1.0 68.0 0.0 0.0 0.0 213.9 112.0 36.360000 260.50 100.0 22.14 233.8 97.0 10.520 8.40 3.0 2.270000 1.0
In [74]:
miss_features.describe()
Out[74]:
churn accountlength internationalplan voicemailplan numbervmailmessages totaldayminutes totaldaycalls totaldaycharge totaleveminutes totalevecalls totalevecharge totalnightminutes totalnightcalls totalnightcharge totalintlminutes totalintlcalls totalintlcharge numbercustomerservicecalls
count 48.000000 48.000000 48.000000 48.000000 48.000000 48.000000 48.000000 48.000000 48.000000 48.000000 48.000000 48.00000 48.000000 48.000000 48.00000 48.000000 48.000000 48.000000
mean 0.812500 104.316667 0.125000 0.250000 7.604167 191.066667 101.116667 32.328083 195.739583 101.104167 16.555417 193.81875 100.562500 8.718875 10.53500 4.270833 2.875935 2.562500
std 0.394443 41.501394 0.392754 0.437595 13.614353 49.643278 17.621739 8.275923 49.482537 18.585329 4.263230 41.56264 20.267263 1.867009 2.34517 2.473172 0.683270 1.855484
min 0.000000 24.000000 0.000000 0.000000 0.000000 105.000000 63.000000 17.850000 75.300000 69.000000 6.400000 130.00000 61.000000 5.850000 5.80000 1.000000 1.570000 0.000000
25% 1.000000 71.750000 0.000000 0.000000 0.000000 154.400000 89.750000 26.252500 169.725000 85.500000 14.377500 162.55000 86.750000 7.317500 8.77500 2.000000 2.372500 1.000000
50% 1.000000 101.000000 0.000000 0.000000 0.000000 177.300000 100.000000 30.140000 191.140000 104.500000 15.790000 180.15000 101.500000 8.105000 10.20000 4.000000 2.750000 2.500000
75% 1.000000 130.000000 0.000000 0.250000 4.500000 227.100000 112.250000 37.885000 226.925000 112.250000 19.287500 228.95000 118.250000 10.305000 12.02500 5.250000 3.260000 4.000000
max 1.000000 225.000000 2.000000 1.000000 40.000000 307.100000 148.000000 52.210000 308.200000 150.000000 26.200000 285.90000 132.000000 12.870000 17.00000 11.000000 4.670000 7.000000
In [75]:
variables = ['churn', 'internationalplan', 'voicemailplan']
fig, ax = plt.subplots(1, 3, figsize=(15, 5))

titles = ['Churn', 'International Plan', 'Voicemail Plan']

for i, variable in enumerate(variables):
    miss_features[variable].value_counts().plot(kind='bar', ax=ax[i])
    ax[i].set_title(titles[i]) 

plt.tight_layout()
No description has been provided for this image
In [76]:
fig, ax = plt.subplots(3, 5, figsize=(15, 10)) 

variables = ['accountlength', 'numbervmailmessages', 'totaldayminutes', 'totaldaycalls', 'totaldaycharge',
             'totaleveminutes', 'totalevecalls', 'totalevecharge', 'totalnightminutes', 'totalnightcalls',
             'totalnightcharge', 'totalintlminutes', 'totalintlcalls', 'totalintlcharge', 'numbercustomerservicecalls']

ax = ax.flatten()

for i, variable in enumerate(variables):
    miss_features[variable].plot.density(ax=ax[i], legend=True)
    ax[i].set_title(variable)

plt.tight_layout()
plt.show()
No description has been provided for this image

"From the vairables disturbution and discrption the main difference to the original dataset is on the variable "Churn" where, unlike the original disturbution, we got more cases in the class churn = Yes. This could happen for two reasons: the customers who are going to cease their subscription are indeed more difficult to predict because of natural reasons (more impulsive etc.) or becuase is the class we got less data so it´s harder for the models to extract knlowlegde and generelize roboust predictions from it. "

Final Recomendations¶

The best prediction model, when trained either with a balanced or unbalanced dataset, was the Gradient Boosting Tree Ensemble w With a high accuracy score of 95%, and a F1 score of 0,82 has the best overall performance and the best at predicting churn

Overall, the Machine Learning criteria was successfully achieved, since several of the created models have a high accuracy rate in predicting customer churn, with high precision and recall. The business success criteria can only be evaluated by using the model to implement changes to reduce costs by increasing customer retention.